Ruby: Parse Excel 95-2003 files? - ruby

Is there a way to read Excel 97-2003 files from Ruby?
Background
I'm currently using the Ruby Gem parseexcel -- http://raa.ruby-lang.org/project/parseexcel/
But it is an old port of the perl module. It works fine, but the latest format it parses is Excel 95. And guess what? Excel 2007 will not produce the Excel 95 format.
John McNamara has taken over duties as the maintainer for the Perl Excel parser, see http://metacpan.org/pod/Spreadsheet::ParseExcel The current version will parse Excel 95-2003 files. But is there a port to Ruby?
My other thought is to build some Ruby to Perl glue code to enable use of the Perl library itself from Ruby. Eg, see What's the best way to export UTF8 data into Excel?
(I think it would be much faster to write the glue code than to port the parser.)
Thanks,
Larry

I'm using spreadsheet, give it a shot.

There is also roo:
http://roo.rubyforge.org/

In my experience spreadsheet works much faster than roo, however roo can support the .xlsx format which spreadsheet cannot.

As khell mentioned, spreadsheet is a great tool. See my code below that I used to build a crawler.
require 'find'
require 'spreadsheet'
Spreadsheet.client_encoding = 'UTF-8'
count = 0
Find.find('/Users/toor/crawler/') do |file| # begin iteration of each file of a specified directory
if file =~ /\b.xls$\b/ # check if a given file is xls format
workbook = Spreadsheet.open(file).worksheets # creates an object containing all worksheets of an excel workbook
workbook.each do |worksheet| # begin iteration over each worksheet
worksheet.each do |row| # begin iteration over each row of a worksheet
if row.to_s =~ /regex/ # rows must be converted to strings in order to match the regex
puts file
count += 1
end
end
end
end
end
puts "#{count} pieces of information were found"

I've not tried to parse Excel files before, but I know FasterCSV is a great library for parsing CSV files (which Excel can produce).

In the case that you are Windows,
you can always use WIN32OLE.
Have a look at http://rubyonwindows.blogspot.com/search/label/excel

Related

Find out if CSV file contains empty field in Ruby?

Using Ruby 1.9.3, I want to read in a CSV file with headers and scan each single field to see if it is left empty and does not contain a value, like foo,,bar,foofoo,barbar(the second one).
My approach is as follows:
require 'CSV'
#read csv file line by line
CSV.foreach(filename,headers:true) do |row|
#loop through each element within the current row
for i in (0..row.length-1)
#check for empty fields
if !row[i]
puts "empty field"
end
end
end
Well, this works, but when processing a file with ~18 million fields, this is quite slow, and I have many of them. Is there any faster and more elegant ways to do this?
Using grep
Edit: Having my big file around I also tested Uri Agassi's aproach using grep to get the lines of the file with empty fields:
File.new(filename).grep(/(^,|,(,|$))/)
It's about 10 times faster. If you need access to the fields you can use CSV.parse:
require 'csv'
File.new("/tmp/big.csv").grep(/(^,|,(,|%))/).each do |row_string|
CSV.parse(row_string) do |row|
puts row[1]
end
end
Using a native CSV parser
Otherwise, if you have to parse the whole CSV file anyway, the answer is most likely no. Try running your script without the checking part - just reading the CSV rows. You will see no change in running time. This is because most of the time is spent reading and parsing the CSV file.
You might wonder if there is a faster CSV library for ruby. There is indeed a gem called FasterCSV but Ruby 1.9 has adopted it as its built-in CSV library, so it probably won't get much faster using Ruby only.
There is a ruby gem named excelsior which uses a native CSV parser. You can install it via gem install excelsior and use it like this:
require 'excelsior'
Excelsior::Reader.rows(File.open('/tmp/big.csv')) do |row|
row.each do |column|
unless column
puts "empty field"
end
end
end
I tested this code with a file like yours (72M, ~30k entries à 2.5k fields) and it is about twice as fast, however it segfaults after a few lines, so the gem might not be stable.
Using CSV
As you mentioned in your comment, there are a few more idiomatic ways to write this, such as using each instead of the for loop or using unless instead of if !, and using two spaces for indentation, which will turn it into:
require 'csv'
CSV.foreach('/tmp/big.csv') do |row|
row.each do |column|
unless column
puts "empty field"
end
end
end
This won't improve the speed though.
Parsing the CSVs could take a lot of your CPU. If all you want is to get the lines which contain an empty field (i.e. contain ,, start with a , or end with a ,), you can use grep on the raw lines of the files, without actually parsing them:
File.new(filename).grep(/(^,|,(,|$))/)
# => all the lines which have an empty field
I'm afraid that you still would go over all the files and read them, so it might not be as fast as you would hope, but unless there is some index on the files, I can't see a way around it.
You can check all columns at once using Enumerable#any?
CSV.foreach(filename,headers:true) do |row|
puts "empty field" if row.any?(&:nil?)
end
I think the grep solution will still be faster. Shelling out to the linux grep command would be the fastest.

Editing a spreadsheet using SPREADSHEET ruby gem

I have to read data from a spread sheet modify some rows and then write the updated rows / cells into the same file.
I have used Spreadsheet gem with Ruby 2.0.0.
When I write the results back to the same file, I am unable to open the xls any more. I get an error
"File Format is not Valid"
in MS Excel.
When the updates are written onto a different file, I am able to open the file but it is in protected view. Is there a solution to this issue?
Below is the sample code:
require 'rubygems'
require 'spreadsheet'
book = Spreadsheet::open('filePath')
sheet = book.worksheet 0
## have application logic in here
book.write('filePath')
I've worked with this problem a few times and they've had the issue on log for around a year now.
The first problem is that it locks the file when spreadsheet loads it and there is no clear way to close it the only way I've been able to get it to not lock is with this code block. It opens it and stores the first worksheet off into its own variable then closes the file.
worksheet = nil
Spreadsheet.open workbook_name do |inner_book|
worksheet = inner_book.worksheet 0
end
worksheet
If you want all the worksheets you could do something similar. In addition to the file opening closing/problem you have the issue around capturing the content of the worksheet depending on the format. I know for my purposes I end up doing the following to capture the content. This sadly loses any formatting you might have had in the source spreadsheet.
rows = []
worksheet.each do |row|
rows << row
end
You can then make your own workbook/sheet and iterate through the rows and add them to the new sheet/book. Then save the new book with the same file name.
Its not fun or efficient, but it is a way to go about solving the problem. Hope this helped.
check your file extension.
spreadsheet, writeexcel..etc gems seem couldn't work with xlsx files.
try .xls not .xlsx

How to read excel values using queries in ruby?

I need to read an Excel sheet(.xls) values with Ruby using query. Is there any gems available in ruby to do this? If so please help me on this.
Any tips or advice on this would be great.
Thanks
Anto
You can use Sequel and OLEDB to read Excel Files:
require 'sequel'
Encoding.default_external = 'utf-8' #needed for umlauts in excel
def read_excel(source)
source = File.expand_path(source) #Full path needed
db = Sequel.ado(:conn_string=>"Provider=Microsoft.Jet.OLEDB.4.0;Data Source=#{source};Extended Properties=Excel 8.0;")
# Excel 2000 (for table names, use a dollar after the sheet name, e.g. Sheet1$)
p db.test_connection
dataset = db[:'Tabelle1$']
p dataset
dataset.each{|row|
puts row
}
end #test_read
read_excel('my_spreadsheet.xls')
You should know the name of the tab (in my example it's Tabelle1)
The 'real' solution here is not Sequel, but the ADO-Interface. I'm not familiar with other ORM, so I may not really help you. But you may check for example active record.
There are hints, how to connect MS-Access or sqlserver via ADO, some use ActiveRecord.
If you replace the connection string with the Excel-String in my Sequel example, then you may use other ORMs.
You may also try to read Excel-Data via an ODBC-connection.
Read data from excel file using spreadsheet gem
require 'spreadsheet'
doc = Spreadsheet.open('simple.xls')
sheet = doc.worksheet(0) # list number, first list is 0 and so on...
val = sheet2[r,c] # read particular cell from list 0, r for row, c for column
Some information is there.
More information on the net, just use Google.

Microsoft Excel spreadsheet used as a computation engine called from code

I have a MS Excel spreadsheet which does some complex computations. I'd like to create a script which will create a CSV file with the results obtained from the spreadsheet.
I could rewrite the logic from the spreadsheet in my programming language (for example Ruby, but I'm open to use a different language), but then I would have to update my code whenever someone changes the logic in the spreadsheet. Is it possible to use a MS Excel spreadsheet as a black box, a computation engine, which can be invoked from my code? Then I would only have write the CSV part and input data download in my code, the whole computation logic could stay in the spreadsheet and could be easily updated.
Ideally, I don't want to add any CSV generation or data download code to the spreadsheet, because it's used by domain-experts (not programmers). Additionally, I have to download some data from the Internet and pass it to the spreadsheet as the input values. I'd like to keep that part of the code externally, in a version control system like Git. One additional note is that the spreadsheet uses the Solver Excel plugin.
Any help how to do that would be very appreciated.
Thanks,
Michal
To manipulate an Excel spreadsheet using Ruby, you may want to use win32ole
Here's a sample script:
data = [["Hello", "World"]]
# Require the WIN32OLE library
require 'win32ole'
# Create an instance of the Excel application object
xl = WIN32OLE.new('Excel.Application')
# Make Excel visible
xl.Visible = 1
# Add a new Workbook object
wb = xl.Workbooks.Add
# Get the first Worksheet
ws = wb.Worksheets(1)
# Set the name of the worksheet tab
ws.Name = 'Sample Worksheet'
# For each row in the data set
data.each_with_index do |row, r|
# For each field in the row
row.each_with_index do |field, c|
# Write the data to the Worksheet
ws.Cells(r+1, c+1).Value = field.to_s
end
end
# Save the workbook
wb.SaveAs('workbook.xls')
# Close the workbook
wb.Close
# Quit Excel
xl.Quit
To work out more complicated code, just record a macro of what you want to do, and then look at the code of your macro, and convert it from VB into Ruby.

save/edit array in and outside ruby

I am having an array like "author","post title","date","time","post category", etc etc
I scrape the details from a forum and I want to
save the data using ruby
update the data using ruby
update the data using text editor or I was thinking of one of OpenOffice programs? Calc would be the best.
I guess to have some kind of SQL database would be a solution but I need quick solution for that (somthing that I can do by myself :-)
any suggestions?
Thank you
YAML is your friend here.
require "yaml"
yaml= ["author","post title","date","time","post category"].to_yaml
File.open("filename", "w") do |f|
f.write(yaml)
end
this will give you
---
- author
- post title
- date
- time
- post category
vice versa you get
require "yaml"
YAML.load(File.read("filename")) # => ["author","post title","date","time","post category"]
Yaml is easily human readable, so you can edit it with any text editor (not word proccessor like ooffice). You can not only searialize array's and strings. Yaml works out of the box for most ruby objects, even for objects of user defined classes. This is a good itrodution into the yaml syntax: http://yaml.kwiki.org/?YamlInFiveMinutes.
If you want to use a spreadsheet, csv is the way to go. You can use the stdlib csv api like:
require 'csv'
my2DArray = [[1,2],["foo","bar"]]
File.open('data.csv', 'w') do |outfile|
CSV::Writer.generate(outfile) do |csv|
my2DArray.each do |row|
csv << row
end
end
end
You can then open the resulting file in calc or in most statistics applications.
The same API can be used to re-import the result in ruby if you need.
You could serialize it to json and save it to a file. This would allow you to edit it using a simple text editor.
if you want to edit it in something like calc, you could consider generating a CSV (comma separated values) file and import it.
If I understand correctly, you have a two-dimensional array. You could output it in csv format like so:
array.each do |row|
puts row.join(",")
end
Then you import it with Calc to edit it or just use a text editor.
If your data might contain commas, you should have a look at the csv module instead:
http://ruby-doc.org/stdlib/libdoc/csv/rdoc/index.html

Resources