Using CSV Class to parse a .csv file in Ruby - ruby

I'm using Ruby 1.9.3 and I've discovered the CSV class, but I can't get it to work. Basically, I want to be able to manipulate the various options for the CSV, and then pull a .csv file into an array to work with, eventually pushing that array back out into a new file.
This is what I have currently:
require 'csv'
CSV_Definition = CSV.New(:header_converters => :symbol)
CSV_Total = CSV.Read(File.Path("C:\Scripts\SQL_Log_0.csv"))
However, I don't think this is the right way to change the :header_converters. Currently I can't get IRB working to parse these pieces of code (I'm not sure how to require 'csv' in IRB) so I don't have any particular error message. My expectations for this will be to create an array (CSV_Total) that has a header with no symbols in it. The next step is to put that array back into a new file. Basically it would scrub CSV files.

Ruby used to have it's own built in CSV library which has been replaced with FasterCSV as of version 1.9, click on the link for documentation.
All that's required on your part is to use to import the CSV class via require 'csv' statement wherever you want to use it and process accordingly. It's pretty easy to build an array with the foreach statement, e.g.,:
people.csv
Merry,Christmas
Hal,Apenyo
Terri,Aki
Willy,Byte
process_people.rb
require 'csv'
people = []
CSV.foreach(File.path("people.csv")) do |row|
# Where row[i] corresponds to a zero-based value/column in the csv
people << [row[0] + " " + row[1]]
end
puts people.to_s
=> [["Merry Christmas"], ["Hal Apenyo"], ["Terri Aki"], ["Willy Byte"]]

Related

How to import a column of a CSV file into a Ruby array?

My goal is to import a one column of a CSV file into a Ruby array. This is for a self-contained Ruby script, not an application. I'll just be running the script in Terminal and getting an output.
I'm having trouble finding the best way to import the file and finding the best way to dynamically insert the name of the file into that line of code. The filename will be different each time, and will be passed in by the user. I'm using $stdin.gets.chomp to ask the user for the filename, and setting it equal to file_name.
Can someone help me with this? Here's what I have for this part of the script:
require 'csv'
zip_array = CSV.read("path/to/file_name.csv")
and I need to be able to insert the proper file path above. Is this correct? And how do I get that path name in there? Maybe I'll need to totally re-structure my script, but any suggestions on how to do this?
There are two questions here, I think. The first is about getting user input from the command line. The usual way to do this is with ARGV. In your program you could do file_name = ARGV[0] so a user could type ruby your_program.rb path/to/file_name.csv on the command line.
The next is about reading CSVs. Using CSV.read will take the whole CSV, not just a single column. If you want to choose one column of many, you are likely better off doing:
zip_array = []
CSV.foreach(file_name) { |row| zip_array << row[whichever_column] }
Okay, first problem:
a) The file name will be different on each run (I'm supposing it will always be a CSV file, right?)
You can solve this problem with creating a folder, say input_data inside your Ruby script. Then do:
Dir.glob('input_data/*.csv')
This will produce an array of ALL files inside that folder that end with CSV. If we assume there will be only 1 file at a time in that folder (with a different name), we can do:
file_name = Dir.glob('input_data/*.csv')[0]
This way you'll dynamically get the file path, no matter what the file is named. If the csv file is inside the same directory as your Ruby script, you can just do:
Dir.glob('*.csv')[0]
Now, for importing only 1 column into a Ruby array (let's suppose it's the first column):
require 'csv'
array = []
CSV.foreach(file_name) do |csv_row|
array << csv_row[0] # [0] for the first column, [1] for the second etc.
end
What if your CSV file has headers? Suppose your column name is 'Total'. You can do:
require 'csv'
array = []
CSV.foreach(file_name, headers: true) do |csv_row|
array << csv_row['Total']
end
Now it doesn't matter if your column is the 1st column, the 3rd etc, as long as it has a header named 'Total', Ruby will find it.
CSV.foreach reads your file line-by-line and is good for big files. CSV.read will read it at once but using it you can make your code more concise:
array = CSV.read(, headers: true).map do |csv_row|
csv_row['Total']
end
Hope this helped.
First, you need to assign the returned value from $stdin.gets.chomp to a variable:
foo = $stdin.gets.chomp
Which will assign the entered input to foo.
You don't need to use $stdin though, as gets will use the standard input channel by default:
foo = gets.chomp
At that point use the variable as your read parameter:
zip_array = CSV.read(foo)
That's all basic coding and covered in any intro book for a language.

Find out if CSV file contains empty field in Ruby?

Using Ruby 1.9.3, I want to read in a CSV file with headers and scan each single field to see if it is left empty and does not contain a value, like foo,,bar,foofoo,barbar(the second one).
My approach is as follows:
require 'CSV'
#read csv file line by line
CSV.foreach(filename,headers:true) do |row|
#loop through each element within the current row
for i in (0..row.length-1)
#check for empty fields
if !row[i]
puts "empty field"
end
end
end
Well, this works, but when processing a file with ~18 million fields, this is quite slow, and I have many of them. Is there any faster and more elegant ways to do this?
Using grep
Edit: Having my big file around I also tested Uri Agassi's aproach using grep to get the lines of the file with empty fields:
File.new(filename).grep(/(^,|,(,|$))/)
It's about 10 times faster. If you need access to the fields you can use CSV.parse:
require 'csv'
File.new("/tmp/big.csv").grep(/(^,|,(,|%))/).each do |row_string|
CSV.parse(row_string) do |row|
puts row[1]
end
end
Using a native CSV parser
Otherwise, if you have to parse the whole CSV file anyway, the answer is most likely no. Try running your script without the checking part - just reading the CSV rows. You will see no change in running time. This is because most of the time is spent reading and parsing the CSV file.
You might wonder if there is a faster CSV library for ruby. There is indeed a gem called FasterCSV but Ruby 1.9 has adopted it as its built-in CSV library, so it probably won't get much faster using Ruby only.
There is a ruby gem named excelsior which uses a native CSV parser. You can install it via gem install excelsior and use it like this:
require 'excelsior'
Excelsior::Reader.rows(File.open('/tmp/big.csv')) do |row|
row.each do |column|
unless column
puts "empty field"
end
end
end
I tested this code with a file like yours (72M, ~30k entries à 2.5k fields) and it is about twice as fast, however it segfaults after a few lines, so the gem might not be stable.
Using CSV
As you mentioned in your comment, there are a few more idiomatic ways to write this, such as using each instead of the for loop or using unless instead of if !, and using two spaces for indentation, which will turn it into:
require 'csv'
CSV.foreach('/tmp/big.csv') do |row|
row.each do |column|
unless column
puts "empty field"
end
end
end
This won't improve the speed though.
Parsing the CSVs could take a lot of your CPU. If all you want is to get the lines which contain an empty field (i.e. contain ,, start with a , or end with a ,), you can use grep on the raw lines of the files, without actually parsing them:
File.new(filename).grep(/(^,|,(,|$))/)
# => all the lines which have an empty field
I'm afraid that you still would go over all the files and read them, so it might not be as fast as you would hope, but unless there is some index on the files, I can't see a way around it.
You can check all columns at once using Enumerable#any?
CSV.foreach(filename,headers:true) do |row|
puts "empty field" if row.any?(&:nil?)
end
I think the grep solution will still be faster. Shelling out to the linux grep command would be the fastest.

Preserve key order loading YAML from a file in Ruby

I want to preserve the order of the keys in a YAML file loaded from disk, processed in some way and written back to disk.
Here is a basic example of loading YAML in Ruby (v1.8.7):
require 'yaml'
configuration = nil
File.open('configuration.yaml', 'r') do |file|
configuration = YAML::load(file)
# at this point configuration is a hash with keys in an undefined order
end
# process configuration in some way
File.open('output.yaml', 'w+') do |file|
YAML::dump(configuration, file)
end
Unfortunately, this will destroy the order of the keys in configuration.yaml once the hash is built. I cannot find a way of controlling what data structure is used by YAML::load(), e.g. alib's orderedmap.
I've had no luck searching the web for a solution.
Use Ruby 1.9.x. Previous version of Ruby do not preserve the order of Hash keys, but 1.9 does.
If you're stuck using 1.8.7 for whatever reason (like I am), I've resorted to using active_support/ordered_hash. I know activesupport seems like a big include, but they've refactored it in later versions to where you pretty much only require the part you need in the file and the rest gets left out. Just gem install activesupport, and include it as shown below. Also, in your YAML file, be sure to use an !!omap declaration (and an array of Hashes). Example time!
# config.yml #
months: !!omap
- january: enero
- february: febrero
- march: marzo
- april: abril
- may: mayo
Here's what the Ruby behind it looks like.
# loader.rb #
require 'yaml'
require 'active_support/ordered_hash'
# Load up the file into a Hash
config = File.open('config.yml','r') { |f| YAML::load f }
# So long as you specified an !!omap, this is actually a
# YAML::PrivateClass, an array of Hashes
puts config['months'].class
# Parse through its value attribute, stick results in an OrderedHash,
# and reassign it to our hash
ordered = ActiveSupport::OrderedHash.new
config['months'].value.each { |m| ordered[m.keys.first] = m.values.first }
config['months'] = ordered
I'm looking for a solution that allows me to recursively dig through a Hash loaded from a .yml file, look for those YAML::PrivateClass objects, and convert them into an ActiveSupport::OrderedHash. I may post a question on that.
Someone came up with the same issue. There is a gem ordered hash. Note that it is not a hash, it creates a subclass of hash. You might give it a try, but if you see a problem dealing with YAML, then you should consider upgrading to ruby1.9.

How do I parse columns in a CSV File in Ruby and store it as an array?

I have the following CSV file:
Date,Av,Sec,128,440,1024,Mixed,,rule,sn,version
6/30/2010,3.40,343,352.0,1245.8,3471.1,650.7,Mbps,on,s-2.8.6-38,4.9.1-229,,vs. 342,-0.26%,-0.91%,1.51%,-0.97%
6/24/2010,3.40,342,352.9,1257.2,3419.5,657.1,Mbps,on,s-2.8.6-38,4.9.1-229,,vs. 341,0.23%,0.50%,-1.34%,0.67%
6/17/2010,3.40,341,352.1,1251.0,3466.1,652.7,Mbps,on,s-2.8.6-38,4.9.1-229,,vs. 340,7.77%,5.32%,9.04%,1.71%
6/14/2010,3.40,340,326.7,1187.8,3178.7,641.7,Mbps,on,s-2.8.6-38,4.9.1-229,,vs. 339,-0.88%,-0.34%,-0.95%,0.05%
6/11/2010,3.40,339,329.6,1191.9,3209.2,641.4,Mbps,on,s-2.8.6-38,4.9.1-229,,vs. 338,0.58%,0.51%,-1.83%,0.99%
6/11/2010,3.40,338,327.7,1185.8,3269.1,635.1,Mbps,on,s-2.8.6-38,4.9.1-229,,vs. 335,-0.40%,-0.44%,1.46%,-1.96%
6/11/2010,3.40,335,329.0,1191.0,3221.9,647.8,Mbps,on,s-2.8.6-38,4.9.1-229,,vs. 333,-6.83%,-4.70%,-7.04%,-0.32%
6/11/2010,3.40,333,353.1,1249.8,3465.8,649.9,Mbps,on,s-2.8.6-38,4.9.1-229,,vs. 332,2.53%,2.02%,1.71%,2.14%
and I want to parse columns 4, 5, 6 and 7 and have four arrays, on which I can do operations like create a line graph against time, etc.
You need the Ruby CSV module which ships with Ruby. Example:
require 'csv'
require 'pp'
file = File.open( 'bar.csv' )
CSV::Reader.create( file ).each do |row|
pp row[4..7]
end
why reinvent the wheel!
Use plugin fastercsv or csv
You are spoiled for choice in parsing CSVs with Ruby, as there are options included in the standard lib, as well as easy home-brewed methods or Open Source libs.
You can start with the examples on "How to parse CSV data with Ruby", and that should point you in the direction for digging deeper.
you can use the smarter_csv Ruby gem and use a :key_mapping to ignore unwanted input columns.
See:
https://github.com/tilo/smarter_csv

save/edit array in and outside ruby

I am having an array like "author","post title","date","time","post category", etc etc
I scrape the details from a forum and I want to
save the data using ruby
update the data using ruby
update the data using text editor or I was thinking of one of OpenOffice programs? Calc would be the best.
I guess to have some kind of SQL database would be a solution but I need quick solution for that (somthing that I can do by myself :-)
any suggestions?
Thank you
YAML is your friend here.
require "yaml"
yaml= ["author","post title","date","time","post category"].to_yaml
File.open("filename", "w") do |f|
f.write(yaml)
end
this will give you
---
- author
- post title
- date
- time
- post category
vice versa you get
require "yaml"
YAML.load(File.read("filename")) # => ["author","post title","date","time","post category"]
Yaml is easily human readable, so you can edit it with any text editor (not word proccessor like ooffice). You can not only searialize array's and strings. Yaml works out of the box for most ruby objects, even for objects of user defined classes. This is a good itrodution into the yaml syntax: http://yaml.kwiki.org/?YamlInFiveMinutes.
If you want to use a spreadsheet, csv is the way to go. You can use the stdlib csv api like:
require 'csv'
my2DArray = [[1,2],["foo","bar"]]
File.open('data.csv', 'w') do |outfile|
CSV::Writer.generate(outfile) do |csv|
my2DArray.each do |row|
csv << row
end
end
end
You can then open the resulting file in calc or in most statistics applications.
The same API can be used to re-import the result in ruby if you need.
You could serialize it to json and save it to a file. This would allow you to edit it using a simple text editor.
if you want to edit it in something like calc, you could consider generating a CSV (comma separated values) file and import it.
If I understand correctly, you have a two-dimensional array. You could output it in csv format like so:
array.each do |row|
puts row.join(",")
end
Then you import it with Calc to edit it or just use a text editor.
If your data might contain commas, you should have a look at the csv module instead:
http://ruby-doc.org/stdlib/libdoc/csv/rdoc/index.html

Resources