How do I parse columns in a CSV File in Ruby and store it as an array? - ruby

I have the following CSV file:
Date,Av,Sec,128,440,1024,Mixed,,rule,sn,version
6/30/2010,3.40,343,352.0,1245.8,3471.1,650.7,Mbps,on,s-2.8.6-38,4.9.1-229,,vs. 342,-0.26%,-0.91%,1.51%,-0.97%
6/24/2010,3.40,342,352.9,1257.2,3419.5,657.1,Mbps,on,s-2.8.6-38,4.9.1-229,,vs. 341,0.23%,0.50%,-1.34%,0.67%
6/17/2010,3.40,341,352.1,1251.0,3466.1,652.7,Mbps,on,s-2.8.6-38,4.9.1-229,,vs. 340,7.77%,5.32%,9.04%,1.71%
6/14/2010,3.40,340,326.7,1187.8,3178.7,641.7,Mbps,on,s-2.8.6-38,4.9.1-229,,vs. 339,-0.88%,-0.34%,-0.95%,0.05%
6/11/2010,3.40,339,329.6,1191.9,3209.2,641.4,Mbps,on,s-2.8.6-38,4.9.1-229,,vs. 338,0.58%,0.51%,-1.83%,0.99%
6/11/2010,3.40,338,327.7,1185.8,3269.1,635.1,Mbps,on,s-2.8.6-38,4.9.1-229,,vs. 335,-0.40%,-0.44%,1.46%,-1.96%
6/11/2010,3.40,335,329.0,1191.0,3221.9,647.8,Mbps,on,s-2.8.6-38,4.9.1-229,,vs. 333,-6.83%,-4.70%,-7.04%,-0.32%
6/11/2010,3.40,333,353.1,1249.8,3465.8,649.9,Mbps,on,s-2.8.6-38,4.9.1-229,,vs. 332,2.53%,2.02%,1.71%,2.14%
and I want to parse columns 4, 5, 6 and 7 and have four arrays, on which I can do operations like create a line graph against time, etc.

You need the Ruby CSV module which ships with Ruby. Example:
require 'csv'
require 'pp'
file = File.open( 'bar.csv' )
CSV::Reader.create( file ).each do |row|
pp row[4..7]
end

why reinvent the wheel!
Use plugin fastercsv or csv

You are spoiled for choice in parsing CSVs with Ruby, as there are options included in the standard lib, as well as easy home-brewed methods or Open Source libs.
You can start with the examples on "How to parse CSV data with Ruby", and that should point you in the direction for digging deeper.

you can use the smarter_csv Ruby gem and use a :key_mapping to ignore unwanted input columns.
See:
https://github.com/tilo/smarter_csv

Related

Find out if CSV file contains empty field in Ruby?

Using Ruby 1.9.3, I want to read in a CSV file with headers and scan each single field to see if it is left empty and does not contain a value, like foo,,bar,foofoo,barbar(the second one).
My approach is as follows:
require 'CSV'
#read csv file line by line
CSV.foreach(filename,headers:true) do |row|
#loop through each element within the current row
for i in (0..row.length-1)
#check for empty fields
if !row[i]
puts "empty field"
end
end
end
Well, this works, but when processing a file with ~18 million fields, this is quite slow, and I have many of them. Is there any faster and more elegant ways to do this?
Using grep
Edit: Having my big file around I also tested Uri Agassi's aproach using grep to get the lines of the file with empty fields:
File.new(filename).grep(/(^,|,(,|$))/)
It's about 10 times faster. If you need access to the fields you can use CSV.parse:
require 'csv'
File.new("/tmp/big.csv").grep(/(^,|,(,|%))/).each do |row_string|
CSV.parse(row_string) do |row|
puts row[1]
end
end
Using a native CSV parser
Otherwise, if you have to parse the whole CSV file anyway, the answer is most likely no. Try running your script without the checking part - just reading the CSV rows. You will see no change in running time. This is because most of the time is spent reading and parsing the CSV file.
You might wonder if there is a faster CSV library for ruby. There is indeed a gem called FasterCSV but Ruby 1.9 has adopted it as its built-in CSV library, so it probably won't get much faster using Ruby only.
There is a ruby gem named excelsior which uses a native CSV parser. You can install it via gem install excelsior and use it like this:
require 'excelsior'
Excelsior::Reader.rows(File.open('/tmp/big.csv')) do |row|
row.each do |column|
unless column
puts "empty field"
end
end
end
I tested this code with a file like yours (72M, ~30k entries à 2.5k fields) and it is about twice as fast, however it segfaults after a few lines, so the gem might not be stable.
Using CSV
As you mentioned in your comment, there are a few more idiomatic ways to write this, such as using each instead of the for loop or using unless instead of if !, and using two spaces for indentation, which will turn it into:
require 'csv'
CSV.foreach('/tmp/big.csv') do |row|
row.each do |column|
unless column
puts "empty field"
end
end
end
This won't improve the speed though.
Parsing the CSVs could take a lot of your CPU. If all you want is to get the lines which contain an empty field (i.e. contain ,, start with a , or end with a ,), you can use grep on the raw lines of the files, without actually parsing them:
File.new(filename).grep(/(^,|,(,|$))/)
# => all the lines which have an empty field
I'm afraid that you still would go over all the files and read them, so it might not be as fast as you would hope, but unless there is some index on the files, I can't see a way around it.
You can check all columns at once using Enumerable#any?
CSV.foreach(filename,headers:true) do |row|
puts "empty field" if row.any?(&:nil?)
end
I think the grep solution will still be faster. Shelling out to the linux grep command would be the fastest.

Using CSV Class to parse a .csv file in Ruby

I'm using Ruby 1.9.3 and I've discovered the CSV class, but I can't get it to work. Basically, I want to be able to manipulate the various options for the CSV, and then pull a .csv file into an array to work with, eventually pushing that array back out into a new file.
This is what I have currently:
require 'csv'
CSV_Definition = CSV.New(:header_converters => :symbol)
CSV_Total = CSV.Read(File.Path("C:\Scripts\SQL_Log_0.csv"))
However, I don't think this is the right way to change the :header_converters. Currently I can't get IRB working to parse these pieces of code (I'm not sure how to require 'csv' in IRB) so I don't have any particular error message. My expectations for this will be to create an array (CSV_Total) that has a header with no symbols in it. The next step is to put that array back into a new file. Basically it would scrub CSV files.
Ruby used to have it's own built in CSV library which has been replaced with FasterCSV as of version 1.9, click on the link for documentation.
All that's required on your part is to use to import the CSV class via require 'csv' statement wherever you want to use it and process accordingly. It's pretty easy to build an array with the foreach statement, e.g.,:
people.csv
Merry,Christmas
Hal,Apenyo
Terri,Aki
Willy,Byte
process_people.rb
require 'csv'
people = []
CSV.foreach(File.path("people.csv")) do |row|
# Where row[i] corresponds to a zero-based value/column in the csv
people << [row[0] + " " + row[1]]
end
puts people.to_s
=> [["Merry Christmas"], ["Hal Apenyo"], ["Terri Aki"], ["Willy Byte"]]

Ruby parse comma separated text file

I need some help with a Ruby script I can call from the console. The script needs to parse a simple .txt file with comma separated values.
value 1, value2, value3, etc...
The values needs to be added to the database.
Any suggestions?
array = File.read("csv_file.txt").split(",").map(&:strip)
You will get the values in the array and use it to store to database. If you want more functions, you can make use of FasterCSV gem.
Ruby 1.9.2 has a very good CSV library which is useful for this stuff: http://www.ruby-doc.org/stdlib/libdoc/csv/rdoc/index.html
On earlier versions of Ruby you could use http://fastercsv.rubyforge.org/ (which essentially became CSV in 1.9.2)
You could do it manually by reading the file into a string and using .split(',') but I'd go with one of the libraries above.
Quick and dirty solution:
result = []
File.open("<path-to-file>","r") do |handle|
handle.each_line do |line|
result << line.split(",").strip
end
end # closes automatically when EOF reached
result.flatten!
result # => big array of values
Now you can iterate the result array and save the values to the database.
This simple file iteration doesn't take care for order or special fields, because it wasn't mentioned in the question.
Something easy to get you started:
IO.readlines("csv_file.txt", '').each do |line|
values = line.split(",").collect(&:strip)
# do something with the values?
end
Hope this helps.

Ruby: Parse Excel 95-2003 files?

Is there a way to read Excel 97-2003 files from Ruby?
Background
I'm currently using the Ruby Gem parseexcel -- http://raa.ruby-lang.org/project/parseexcel/
But it is an old port of the perl module. It works fine, but the latest format it parses is Excel 95. And guess what? Excel 2007 will not produce the Excel 95 format.
John McNamara has taken over duties as the maintainer for the Perl Excel parser, see http://metacpan.org/pod/Spreadsheet::ParseExcel The current version will parse Excel 95-2003 files. But is there a port to Ruby?
My other thought is to build some Ruby to Perl glue code to enable use of the Perl library itself from Ruby. Eg, see What's the best way to export UTF8 data into Excel?
(I think it would be much faster to write the glue code than to port the parser.)
Thanks,
Larry
I'm using spreadsheet, give it a shot.
There is also roo:
http://roo.rubyforge.org/
In my experience spreadsheet works much faster than roo, however roo can support the .xlsx format which spreadsheet cannot.
As khell mentioned, spreadsheet is a great tool. See my code below that I used to build a crawler.
require 'find'
require 'spreadsheet'
Spreadsheet.client_encoding = 'UTF-8'
count = 0
Find.find('/Users/toor/crawler/') do |file| # begin iteration of each file of a specified directory
if file =~ /\b.xls$\b/ # check if a given file is xls format
workbook = Spreadsheet.open(file).worksheets # creates an object containing all worksheets of an excel workbook
workbook.each do |worksheet| # begin iteration over each worksheet
worksheet.each do |row| # begin iteration over each row of a worksheet
if row.to_s =~ /regex/ # rows must be converted to strings in order to match the regex
puts file
count += 1
end
end
end
end
end
puts "#{count} pieces of information were found"
I've not tried to parse Excel files before, but I know FasterCSV is a great library for parsing CSV files (which Excel can produce).
In the case that you are Windows,
you can always use WIN32OLE.
Have a look at http://rubyonwindows.blogspot.com/search/label/excel

save/edit array in and outside ruby

I am having an array like "author","post title","date","time","post category", etc etc
I scrape the details from a forum and I want to
save the data using ruby
update the data using ruby
update the data using text editor or I was thinking of one of OpenOffice programs? Calc would be the best.
I guess to have some kind of SQL database would be a solution but I need quick solution for that (somthing that I can do by myself :-)
any suggestions?
Thank you
YAML is your friend here.
require "yaml"
yaml= ["author","post title","date","time","post category"].to_yaml
File.open("filename", "w") do |f|
f.write(yaml)
end
this will give you
---
- author
- post title
- date
- time
- post category
vice versa you get
require "yaml"
YAML.load(File.read("filename")) # => ["author","post title","date","time","post category"]
Yaml is easily human readable, so you can edit it with any text editor (not word proccessor like ooffice). You can not only searialize array's and strings. Yaml works out of the box for most ruby objects, even for objects of user defined classes. This is a good itrodution into the yaml syntax: http://yaml.kwiki.org/?YamlInFiveMinutes.
If you want to use a spreadsheet, csv is the way to go. You can use the stdlib csv api like:
require 'csv'
my2DArray = [[1,2],["foo","bar"]]
File.open('data.csv', 'w') do |outfile|
CSV::Writer.generate(outfile) do |csv|
my2DArray.each do |row|
csv << row
end
end
end
You can then open the resulting file in calc or in most statistics applications.
The same API can be used to re-import the result in ruby if you need.
You could serialize it to json and save it to a file. This would allow you to edit it using a simple text editor.
if you want to edit it in something like calc, you could consider generating a CSV (comma separated values) file and import it.
If I understand correctly, you have a two-dimensional array. You could output it in csv format like so:
array.each do |row|
puts row.join(",")
end
Then you import it with Calc to edit it or just use a text editor.
If your data might contain commas, you should have a look at the csv module instead:
http://ruby-doc.org/stdlib/libdoc/csv/rdoc/index.html

Resources