Parsing through excel using ruby gem "creek"

Parsing through excel using ruby gem "creek" - ruby

Hey guys so I am trying to parse through an excel file through the ruby gem "creek", it parses the the rows accurately but I want to just retrieve the Columns, such as only the data in the "A" cloumn. Outputs the whole excel documents correctly.
require 'creek'
creek = Creek::Book.new 'Final.xlsx'
sheet= creek.sheets[0]
sheet.rows.each do |row|
puts row # => {"A1"=>"Content 1", "B1"=>nil, C1"=>nil, "D1"=>"Content 3"}
end
Any suggestions will be much appreciated.

Creek doesn't make it easy to extract column information because it stores the column and row smashed together in a string hash key.
The more popular Roo allows you to do things like sheet.column(1) and get an entire column. Very simple.
If you absolutely must have creek, I noticed that there is an add-on to Creek called Ditch which adds some column-fetching capability. Example:
sheet.rows.each { |r|
puts "#{r.index} #{r.get('A')} - #{r.get('B')}"
}
Finally, if you want to do it with Creek and no add-ons, use Hash#select:
sheet.rows.each do |row|
puts row.select{ |k,v| ["A", "B"].include? k[0]}
end

To read the individual columns you can use Creek :: Sheet # simple_rows method
For example, to read the first and third columns:
require 'creek'
creek = Creek::Book.new 'Final.xlsx'
sheet_first = creek.sheets.first
# read first column A
col_first = sheet_first.simple_rows.map{|col| col['A']} #=> Array containing the first column
# read third column C
col_third = sheet_first.simple_rows.map{|col| col['C']} #=> Array containing the third column

Related

Ruby CSV converter, remove all converters?

I have some data I was writing from one CSV to another CSV because I need to do some data manipulation.
I noticed the CSV library has some default converters that are taking my values that look like dates and parsing those into new date strings.
I was wondering if I could remove all converters? I tried using my custom converter, but no matter what I do it seems that the dates keep getting parsed.
Here is my code simplified:
require 'csv'
CSV::Converters[:my_converter] = lambda do |value|
value
end
CSV.open('new-data.csv', 'w') do |csv|
data = CSV.read('original-data.csv', :converters => [:my_converter]).each do |row|
csv << row
end
end
The value 9/30/14 0:00 is getting changed to 9/30/2014 0:00, for example.

Are you sure that your CSV file doesn't actually contain the 4-digit year? Try looking at puts File.read('original-data.csv')
When I tried this on Ruby 2.1.8, it didn't change the value
require 'csv'
my_csv_data = 'hello,"9/30/14 0:00",world'
CSV.new(my_csv_data).each do |row|
puts row.inspect # prints ["hello", "9/30/14 0:00", "world"], as expected
end

CSV files are not parsed and converted into objects, the data in the fields is returned as a string. Always. This behavior is different than YAML or JSON, which do convert back to their base types.
Consider this:
require 'csv'
CSV.parse("1,10/1/14,foo") # => [["1", "10/1/14", "foo"]]
All values are strings.
csv = ["foo", 'bar', 1, Date.new(2014, 10, 1)].to_csv # => "foo,bar,1,2014-10-01\n"
Converting an array containing native Ruby objects results in a string of comma-delimited values.
CSV.parse(csv) # => [["foo", "bar", "1", "2014-10-01"]]
Reparsing that string returns the string versions but doesn't attempt to return them to their original types as CSV doesn't have a way of knowing what those were. The developer (you) has to know and do that.
The end-result of all that is that CSV won't change a year from '14' to '2014'. It doesn't know that it's a date, and, because it's not CSV's place to convert to objects, it only splits the fields appropriately and passes the information on to be massaged by the developer.

Merging CSV tables with Ruby

I'm trying to join CSV files containing stock indexes with Ruby, and having a surprisingly hard time understanding the documentation out there. It's late, and I could use a friend, so go easy on me:
I have several files, with identical headers:
["Date", "Open", "High", "Low", "Close", "Volume"]
I would like my ruby script to read each "Date" column, and write to a new CSV compiling an all encompassing date range from the earliest date to the latest.
Bonus:
Ideally, I would like to add all of the other column data ("Open", "High", etc.) into this new CSV file, split by a column simply containing the following CSV's filename for reference.
Thanks for any consideration given to this. What I'd really like to do is sit down with a Ruby sensei to help me make sense of the documentation. How can I use the CSV.read() or CSV.foreach() do |x| methods to create arrays / hashes to perform upon?
(Theoretical and intelligent responses welcomed)
hypothetical:
CSV.read("data/DOW.csv") do |output|
puts output
end
returns:
[["Date", "Open", "High", "Low", "Close", "Volume"], ["2014-07-14", "71.35", "71.52", "70.82", "71.28", "823063.0"], ["2014-07-15", "71.32", "71.76", "71.0", "71.28", "813861.0"], ["2014-07-16", "71.34", "71.58", "70.68", "71.02", "843347.0"], ["2014-07-17", "70.54", "71.46", "70.54", "71.13", "1303839.0"], ["2014-07-18", "71.46", "72.95", "71.09", "72.46", "1375922.0"], ["2014-07-21", "72.21", "73.46", "71.88", "73.38", "1603854.0"], ["2014-07-22", "73.46", "74.76", "73.46", "74.57", "1335305.0"], ["2014-07-23", "74.54", "75.1", "73.77", "74.88", "1834953.0"]]
How can I identify rows, columns, etc? I'm looking for methods or ways to transform this array into hashes etc. Honestly, an overarching theoretical approach would suit my needs.

I've been playing with Ruby and CSV most of this day, I might be able to help (even though I am beginner myself) but I don't understand what do you want as output (little example would help).
This example would load only columns "Date", "High" and "Volume" into "my_array".
my_array = []
CSV.foreach("data.csv") do |row|
my_array.push([row[0], row[2], row[5]])
end
If you want every column try:
my_array = []
CSV.foreach("data.csv") do |row|
my_array.push(row)
end
If you want to access element of array inside array:
puts my_array[0][0].inspect #=> "Date"
puts my_array[1][0].inspect #=> "2014-07-14"
When you finally get what you want as output, if you are on Windows you can do this from command prompt to save it:
my_file.rb > output_in_text_form.txt

You can do something like this:
#!/usr/bin/env ruby
require 'csv'
input = ARGV.shift
output = ARGV.shift
File.open(output, 'w') do |o|
csv_string = File.read(input)
CSV.parse(csv_string).each do |r|
# r is an array of columns. Do something with it.
...
# Generate string version.
new_csv_row = CSV.generate_line(r, {:force_quotes => true})
# Write to file
o.puts new_csv_row
end
end
Using files is optional. You can use shell redirection and directly read from STDIN and/or directly write to STDOUT.

How do I make an array of arrays out of a CSV?

I have a CSV file that looks like this:
Jenny, jenny#example.com ,
Ricky, ricky#example.com ,
Josefina josefina#example.com ,
I'm trying to get this output:
users_array = [
['Jenny', 'jenny#example.com'], ['Ricky', 'ricky#example.com'], ['Josefina', 'josefina#example.com']
]
I've tried this:
users_array = Array.new
file = File.new('csv_file.csv', 'r')
file.each_line("\n") do |row|
puts row + "\n"
columns = row.split(",")
users_array.push columns
puts users_array
end
Unfortunately, in Terminal, this returns:
Jenny
jenny#example.com
Ricky
ricky#example.com
Josefina
josefina#example.com
Which I don't think will work for this:
users_array.each_with_index do |user|
add_page.form_with(:id => 'new_user') do |f|
f.field_with(:id => "user_email").value = user[0]
f.field_with(:id => "user_name").value = user[1]
end.click_button
end
What do I need to change? Or is there a better way to solve this problem?

Ruby's standard library has a CSV class with a similar api to File but contains a number of useful methods for working with tabular data. To get the output you want, all you need to do is this:
require 'csv'
users_array = CSV.read('csv_file.csv')
PS - I think you are getting the output you expected with your file parsing as well, but maybe you're thrown off by how it is printing to the terminal. puts behaves differently with arrays, printing each member object on a new line instead of as a single array. If you want to view it as an array, use puts my_array.inspect.

Assuming that your CSV file actually has a comma between the name and email address on the third line:
require 'csv'
users_array = []
CSV.foreach('csv_file.csv') do |row|
users_array.push row.delete_if(&:nil?).map(&:strip)
end
users_array
# => [["Jenny", "jenny#example.com"],
# ["Ricky", "ricky#example.com"],
# ["Josefina", "josefina#example.com"]]
There may be a simpler way, but what I'm doing there is discarding the nil field created by the trailing comma and stripping the spaces around the email addresses.

How do I read the content of an Excel spreadsheet using Ruby?

I am trying to read an Excel spreadsheet file with Ruby, but it is not reading the content of the file.
This is my script
book = Spreadsheet.open 'myexcel.xls';
sheet1 = book.worksheet 0
sheet1.each do |row|
puts row.inspect ;
puts row.format 2;
puts row[1];
exit;
end
It is giving me the following:
[DEPRECATED] By requiring 'parseexcel', 'parseexcel/parseexcel' and/or
'parseexcel/parser' you are loading a Compatibility layer which
provides a drop-in replacement for the ParseExcel library. This
code makes the reading of Spreadsheet documents less efficient and
will be removed in Spreadsheet version 1.0.0
#<Spreadsheet::Excel::Row:0xffffffdbc3e0d2 #worksheet=#<Spreadsheet::Excel::Worksheet:0xb79b8fe0> #outline_level=0 #idx=0 #hidden=false #height= #default_format= #formats= []>
#<Spreadsheet::Format:0xb79bc8ac>
nil
I need to get the actual content of file. What am I doing wrong?

It looks like row, whose class is Spreadsheet::Excel::Row is effectively an Excel Range and that it either includes Enumerable or at least exposes some enumerable behaviours, #each, for example.
So you might rewrite your script something like this:
require 'spreadsheet'
book = Spreadsheet.open('myexcel.xls')
sheet1 = book.worksheet('Sheet1') # can use an index or worksheet name
sheet1.each do |row|
break if row[0].nil? # if first cell empty
puts row.join(',') # looks like it calls "to_s" on each cell's Value
end
Note that I've parenthesised arguments, which is generally advisable these days, and removed the semi-colons, which are not necessary unless you're writing multiple statement on a line (which you should rarely - if ever - do).
It's probably a hangover from a larger script, but I'll point out that in the code given the book and sheet1 variables aren't really needed, and that Spreadsheet#open takes a block, so a more idiomatic Ruby version might be something like this:
require 'spreadsheet'
Spreadsheet.open('MyTestSheet.xls') do |book|
book.worksheet('Sheet1').each do |row|
break if row[0].nil?
puts row.join(',')
end
end

I don't think you need to require parseexcel, just require 'spreadsheet'
Have you read the guide, it is super easy to follow.

Is it a one line file? If so you need:
puts row[0];

Parsing XLS and XLSX (MS Excel) files with Ruby?

Are there any gems able to parse XLS and XLSX files? I've found Spreadsheet and ParseExcel, but they both don't understand XLSX format.

I recently needed to parse some Excel files with Ruby. The abundance of libraries and options turned out to be confusing, so I wrote a blog post about it.
Here is a table of different Ruby libraries and what they support:
If you care about performance, here is how the xlsx libraries compare:
I have sample code to read xlsx files with each supported library here
Here are some examples for reading xlsx files with some different libraries:
rubyXL
require 'rubyXL'
workbook = RubyXL::Parser.parse './sample_excel_files/xlsx_500_rows.xlsx'
worksheets = workbook.worksheets
puts "Found #{worksheets.count} worksheets"
worksheets.each do |worksheet|
puts "Reading: #{worksheet.sheet_name}"
num_rows = 0
worksheet.each do |row|
row_cells = row.cells.map{ |cell| cell.value }
num_rows += 1
end
puts "Read #{num_rows} rows"
end
roo
require 'roo'
workbook = Roo::Spreadsheet.open './sample_excel_files/xlsx_500_rows.xlsx'
worksheets = workbook.sheets
puts "Found #{worksheets.count} worksheets"
worksheets.each do |worksheet|
puts "Reading: #{worksheet}"
num_rows = 0
workbook.sheet(worksheet).each_row_streaming do |row|
row_cells = row.map { |cell| cell.value }
num_rows += 1
end
puts "Read #{num_rows} rows"
end
creek
require 'creek'
workbook = Creek::Book.new './sample_excel_files/xlsx_500_rows.xlsx'
worksheets = workbook.sheets
puts "Found #{worksheets.count} worksheets"
worksheets.each do |worksheet|
puts "Reading: #{worksheet.name}"
num_rows = 0
worksheet.rows.each do |row|
row_cells = row.values
num_rows += 1
end
puts "Read #{num_rows} rows"
end
simple_xlsx_reader
require 'simple_xlsx_reader'
workbook = SimpleXlsxReader.open './sample_excel_files/xlsx_500000_rows.xlsx'
worksheets = workbook.sheets
puts "Found #{worksheets.count} worksheets"
worksheets.each do |worksheet|
puts "Reading: #{worksheet.name}"
num_rows = 0
worksheet.rows.each do |row|
row_cells = row
num_rows += 1
end
puts "Read #{num_rows} rows"
end
Here is an example of reading a legacy xls file using the spreadsheet library:
spreadsheet
require 'spreadsheet'
# Note: spreadsheet only supports .xls files (not .xlsx)
workbook = Spreadsheet.open './sample_excel_files/xls_500_rows.xls'
worksheets = workbook.worksheets
puts "Found #{worksheets.count} worksheets"
worksheets.each do |worksheet|
puts "Reading: #{worksheet.name}"
num_rows = 0
worksheet.rows.each do |row|
row_cells = row.to_a.map{ |v| v.methods.include?(:value) ? v.value : v }
num_rows += 1
end
puts "Read #{num_rows} rows"
end

Just found roo, that might do the job - works for my requirements, reading a basic spreadsheet.

The roo gem works great for Excel (.xls and .xlsx) and it's being actively developed.
I agree the syntax is not great nor ruby-like. But that can be easily achieved with something like:
class Spreadsheet
def initialize(file_path)
#xls = Roo::Spreadsheet.open(file_path)
end
def each_sheet
#xls.sheets.each do |sheet|
#xls.default_sheet = sheet
yield sheet
end
end
def each_row
0.upto(#xls.last_row) do |index|
yield #xls.row(index)
end
end
def each_column
0.upto(#xls.last_column) do |index|
yield #xls.column(index)
end
end
end

I'm using creek which uses nokogiri. It is fast. Used 8.3 seconds on a 21x11250 xlsx table on my Macbook Air. Got it to work on ruby 1.9.3+. The output format for each row is a hash of row and column name to cell content:
{"A1"=>"a cell", "B1"=>"another cell"}
The hash makes no guarantee that the keys will be in the original column order.
https://github.com/pythonicrubyist/creek
dullard is another great one that uses nokogiri. It is super fast. Used 6.7 seconds on a 21x11250 xlsx table on my Macbook Air. Got it to work on ruby 2.0.0+. The output format for each row is an array:
["a cell", "another cell"]
https://github.com/thirtyseven/dullard
simple_xlsx_reader which has been mentioned is great, a bit slow. Used 91 seconds on a 21x11250 xlsx table on my Macbook Air. Got it to work on ruby 1.9.3+. The output format for each row is an array:
["a cell", "another cell"]
https://github.com/woahdae/simple_xlsx_reader
Another interesting one is oxcelix. It uses ox's SAX parser which supposedly faster than both nokogiri's DOM and SAX parser. It supposedly outputs a Matrix. I could not get it to work. Also, there were some dependency issues with rubyzip. Would not recommend it.
In conclusion, creek seems like a good choice. Other posts recommend simple_xlsx_parser as it has similar performance.
Removed dullard as recommended as it's outdated and people are getting errors/having problems with it.

If you're looking for more modern libraries, take a look at Spreadsheet: http://spreadsheet.rubyforge.org/GUIDE_txt.html.
I can't tell if it supports XLSX files, but considering that it is actively developed, I'm guessing it does (I'm not on Windows, or with Office, so I can't test).
At this point, it looks like roo is a good option again. It supports XLSX, allows (some) iteration by just using times with cell access. I admit, it's not pretty though.
Also, RubyXL can now give you a sort of iteration using their extract_data method, which gives you a 2d array of data, which can be easily iterated over.
Alternatively, if you're trying to work with XLSX files on Windows, you can use Ruby's Win32OLE library that allows you to interface with OLE objects, like the ones provided by Word and Excel. However, as #PanagiotisKanavos mentioned in the comments, this has a few major drawbacks:
Excel must be installed
A new Excel instance is started for each document
Memory and other resource consumption is far more than what is necessary for simple XLSX document manipulation.
But if you choose to use it, you can choose not to display Excel, load your XLSX file, and access it through it. I'm not sure if it supports iteration, however, I don't think it would be too hard to build around the supplied methods, as it is the full Microsoft OLE API for Excel.
Here's the documentation: http://support.microsoft.com/kb/222101
Here's the gem: http://www.ruby-doc.org/stdlib-1.9.3/libdoc/win32ole/rdoc/WIN32OLE.html
Again, the options don't look much better, but there isn't much else out there, I'm afraid. it's hard to parse a file format that is a black box. And those few who managed to break it didn't do it that visibly. Google Docs is closed source, and LibreOffice is thousands of lines of harry C++.

I've been working heavily with both Spreadsheet and rubyXL these past couple weeks and I must say that both are great tools. However, one area that both suffer is the lack of examples on actually implementing anything useful. Currently I'm building a crawler and using rubyXL to parse xlsx files and Spreadsheet for anything xls. I hope the code below can serve as a helpful example and show just how effective these tools can be.
require 'find'
require 'rubyXL'
count = 0
Find.find('/Users/Anconia/crawler/') do |file| # begin iteration of each file of a specified directory
if file =~ /\b.xlsx$\b/ # check if file is xlsx format
workbook = RubyXL::Parser.parse(file).worksheets # creates an object containing all worksheets of an excel workbook
workbook.each do |worksheet| # begin iteration over each worksheet
data = worksheet.extract_data.to_s # extract data of a given worksheet - must be converted to a string in order to match a regex
if data =~ /regex/
puts file
count += 1
end
end
end
end
puts "#{count} files were found"
require 'find'
require 'spreadsheet'
Spreadsheet.client_encoding = 'UTF-8'
count = 0
Find.find('/Users/Anconia/crawler/') do |file| # begin iteration of each file of a specified directory
if file =~ /\b.xls$\b/ # check if a given file is xls format
workbook = Spreadsheet.open(file).worksheets # creates an object containing all worksheets of an excel workbook
workbook.each do |worksheet| # begin iteration over each worksheet
worksheet.each do |row| # begin iteration over each row of a worksheet
if row.to_s =~ /regex/ # rows must be converted to strings in order to match the regex
puts file
count += 1
end
end
end
end
end
puts "#{count} files were found"

The rubyXL gem parses XLSX files beautifully.

I couldn't find a satisfactory xlsx parser. RubyXL doesn't do date typecasting, Roo tried to typecast a number as a date, and both are a mess both in api and code.
So, I wrote simple_xlsx_reader. You'd have to use something else for xls, though, so maybe it's not the full answer you're looking for.

Most of the online examples including the author's website for the Spreadsheet gem demonstrate reading the entire contents of an Excel file into RAM. That's fine if your spreadsheet is small.
xls = Spreadsheet.open(file_path)
For anyone working with very large files, a better way is to stream-read the contents of the file. The Spreadsheet gem supports this--albeit not well documented at this time (circa 3/2015).
Spreadsheet.open(file_path).worksheets.first.rows do |row|
# do something with the array of CSV data
end
CITE: https://github.com/zdavatz/spreadsheet

The RemoteTable library uses roo internally. It makes it easy to read spreadsheets of different formats (XLS, XLSX, CSV, etc. possibly remote, possibly stored inside a zip, gz, etc.):
require 'remote_table'
r = RemoteTable.new 'http://www.fueleconomy.gov/FEG/epadata/02data.zip', :filename => 'guide_jan28.xls'
r.each do |row|
puts row.inspect
end
Output:
{"Class"=>"TWO SEATERS", "Manufacturer"=>"ACURA", "carline name"=>"NSX", "displ"=>"3.0", "cyl"=>"6.0", "trans"=>"Auto(S4)", "drv"=>"R", "bidx"=>"60.0", "cty"=>"17.0", "hwy"=>"24.0", "cmb"=>"20.0", "ucty"=>"19.1342", "uhwy"=>"30.2", "ucmb"=>"22.9121", "fl"=>"P", "G"=>"", "T"=>"", "S"=>"", "2pv"=>"", "2lv"=>"", "4pv"=>"", "4lv"=>"", "hpv"=>"", "hlv"=>"", "fcost"=>"1238.0", "eng dscr"=>"DOHC-VTEC", "trans dscr"=>"2MODE", "vpc"=>"4.0", "cls"=>"1.0"}
{"Class"=>"TWO SEATERS", "Manufacturer"=>"ACURA", "carline name"=>"NSX", "displ"=>"3.2", "cyl"=>"6.0", "trans"=>"Manual(M6)", "drv"=>"R", "bidx"=>"65.0", "cty"=>"17.0", "hwy"=>"24.0", "cmb"=>"19.0", "ucty"=>"18.7", "uhwy"=>"30.4", "ucmb"=>"22.6171", "fl"=>"P", "G"=>"", "T"=>"", "S"=>"", "2pv"=>"", "2lv"=>"", "4pv"=>"", "4lv"=>"", "hpv"=>"", "hlv"=>"", "fcost"=>"1302.0", "eng dscr"=>"DOHC-VTEC", "trans dscr"=>"", "vpc"=>"4.0", "cls"=>"1.0"}
{"Class"=>"TWO SEATERS", "Manufacturer"=>"ASTON MARTIN", "carline name"=>"ASTON MARTIN VANQUISH", "displ"=>"5.9", "cyl"=>"12.0", "trans"=>"Auto(S6)", "drv"=>"R", "bidx"=>"1.0", "cty"=>"12.0", "hwy"=>"19.0", "cmb"=>"14.0", "ucty"=>"13.55", "uhwy"=>"24.7", "ucmb"=>"17.015", "fl"=>"P", "G"=>"G", "T"=>"", "S"=>"", "2pv"=>"", "2lv"=>"", "4pv"=>"", "4lv"=>"", "hpv"=>"", "hlv"=>"", "fcost"=>"1651.0", "eng dscr"=>"GUZZLER", "trans dscr"=>"CLKUP", "vpc"=>"4.0", "cls"=>"1.0"}

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio