Find out if CSV file contains empty field in Ruby? - ruby

Using Ruby 1.9.3, I want to read in a CSV file with headers and scan each single field to see if it is left empty and does not contain a value, like foo,,bar,foofoo,barbar(the second one).
My approach is as follows:
require 'CSV'
#read csv file line by line
CSV.foreach(filename,headers:true) do |row|
#loop through each element within the current row
for i in (0..row.length-1)
#check for empty fields
if !row[i]
puts "empty field"
end
end
end
Well, this works, but when processing a file with ~18 million fields, this is quite slow, and I have many of them. Is there any faster and more elegant ways to do this?

Using grep
Edit: Having my big file around I also tested Uri Agassi's aproach using grep to get the lines of the file with empty fields:
File.new(filename).grep(/(^,|,(,|$))/)
It's about 10 times faster. If you need access to the fields you can use CSV.parse:
require 'csv'
File.new("/tmp/big.csv").grep(/(^,|,(,|%))/).each do |row_string|
CSV.parse(row_string) do |row|
puts row[1]
end
end
Using a native CSV parser
Otherwise, if you have to parse the whole CSV file anyway, the answer is most likely no. Try running your script without the checking part - just reading the CSV rows. You will see no change in running time. This is because most of the time is spent reading and parsing the CSV file.
You might wonder if there is a faster CSV library for ruby. There is indeed a gem called FasterCSV but Ruby 1.9 has adopted it as its built-in CSV library, so it probably won't get much faster using Ruby only.
There is a ruby gem named excelsior which uses a native CSV parser. You can install it via gem install excelsior and use it like this:
require 'excelsior'
Excelsior::Reader.rows(File.open('/tmp/big.csv')) do |row|
row.each do |column|
unless column
puts "empty field"
end
end
end
I tested this code with a file like yours (72M, ~30k entries à 2.5k fields) and it is about twice as fast, however it segfaults after a few lines, so the gem might not be stable.
Using CSV
As you mentioned in your comment, there are a few more idiomatic ways to write this, such as using each instead of the for loop or using unless instead of if !, and using two spaces for indentation, which will turn it into:
require 'csv'
CSV.foreach('/tmp/big.csv') do |row|
row.each do |column|
unless column
puts "empty field"
end
end
end
This won't improve the speed though.

Parsing the CSVs could take a lot of your CPU. If all you want is to get the lines which contain an empty field (i.e. contain ,, start with a , or end with a ,), you can use grep on the raw lines of the files, without actually parsing them:
File.new(filename).grep(/(^,|,(,|$))/)
# => all the lines which have an empty field
I'm afraid that you still would go over all the files and read them, so it might not be as fast as you would hope, but unless there is some index on the files, I can't see a way around it.

You can check all columns at once using Enumerable#any?
CSV.foreach(filename,headers:true) do |row|
puts "empty field" if row.any?(&:nil?)
end
I think the grep solution will still be faster. Shelling out to the linux grep command would be the fastest.

Related

How to import a column of a CSV file into a Ruby array?

My goal is to import a one column of a CSV file into a Ruby array. This is for a self-contained Ruby script, not an application. I'll just be running the script in Terminal and getting an output.
I'm having trouble finding the best way to import the file and finding the best way to dynamically insert the name of the file into that line of code. The filename will be different each time, and will be passed in by the user. I'm using $stdin.gets.chomp to ask the user for the filename, and setting it equal to file_name.
Can someone help me with this? Here's what I have for this part of the script:
require 'csv'
zip_array = CSV.read("path/to/file_name.csv")
and I need to be able to insert the proper file path above. Is this correct? And how do I get that path name in there? Maybe I'll need to totally re-structure my script, but any suggestions on how to do this?
There are two questions here, I think. The first is about getting user input from the command line. The usual way to do this is with ARGV. In your program you could do file_name = ARGV[0] so a user could type ruby your_program.rb path/to/file_name.csv on the command line.
The next is about reading CSVs. Using CSV.read will take the whole CSV, not just a single column. If you want to choose one column of many, you are likely better off doing:
zip_array = []
CSV.foreach(file_name) { |row| zip_array << row[whichever_column] }
Okay, first problem:
a) The file name will be different on each run (I'm supposing it will always be a CSV file, right?)
You can solve this problem with creating a folder, say input_data inside your Ruby script. Then do:
Dir.glob('input_data/*.csv')
This will produce an array of ALL files inside that folder that end with CSV. If we assume there will be only 1 file at a time in that folder (with a different name), we can do:
file_name = Dir.glob('input_data/*.csv')[0]
This way you'll dynamically get the file path, no matter what the file is named. If the csv file is inside the same directory as your Ruby script, you can just do:
Dir.glob('*.csv')[0]
Now, for importing only 1 column into a Ruby array (let's suppose it's the first column):
require 'csv'
array = []
CSV.foreach(file_name) do |csv_row|
array << csv_row[0] # [0] for the first column, [1] for the second etc.
end
What if your CSV file has headers? Suppose your column name is 'Total'. You can do:
require 'csv'
array = []
CSV.foreach(file_name, headers: true) do |csv_row|
array << csv_row['Total']
end
Now it doesn't matter if your column is the 1st column, the 3rd etc, as long as it has a header named 'Total', Ruby will find it.
CSV.foreach reads your file line-by-line and is good for big files. CSV.read will read it at once but using it you can make your code more concise:
array = CSV.read(, headers: true).map do |csv_row|
csv_row['Total']
end
Hope this helped.
First, you need to assign the returned value from $stdin.gets.chomp to a variable:
foo = $stdin.gets.chomp
Which will assign the entered input to foo.
You don't need to use $stdin though, as gets will use the standard input channel by default:
foo = gets.chomp
At that point use the variable as your read parameter:
zip_array = CSV.read(foo)
That's all basic coding and covered in any intro book for a language.

Using CSV Class to parse a .csv file in Ruby

I'm using Ruby 1.9.3 and I've discovered the CSV class, but I can't get it to work. Basically, I want to be able to manipulate the various options for the CSV, and then pull a .csv file into an array to work with, eventually pushing that array back out into a new file.
This is what I have currently:
require 'csv'
CSV_Definition = CSV.New(:header_converters => :symbol)
CSV_Total = CSV.Read(File.Path("C:\Scripts\SQL_Log_0.csv"))
However, I don't think this is the right way to change the :header_converters. Currently I can't get IRB working to parse these pieces of code (I'm not sure how to require 'csv' in IRB) so I don't have any particular error message. My expectations for this will be to create an array (CSV_Total) that has a header with no symbols in it. The next step is to put that array back into a new file. Basically it would scrub CSV files.
Ruby used to have it's own built in CSV library which has been replaced with FasterCSV as of version 1.9, click on the link for documentation.
All that's required on your part is to use to import the CSV class via require 'csv' statement wherever you want to use it and process accordingly. It's pretty easy to build an array with the foreach statement, e.g.,:
people.csv
Merry,Christmas
Hal,Apenyo
Terri,Aki
Willy,Byte
process_people.rb
require 'csv'
people = []
CSV.foreach(File.path("people.csv")) do |row|
# Where row[i] corresponds to a zero-based value/column in the csv
people << [row[0] + " " + row[1]]
end
puts people.to_s
=> [["Merry Christmas"], ["Hal Apenyo"], ["Terri Aki"], ["Willy Byte"]]

How to compare data in two CSV files

I have two CSV files which have the same structure and ideally should have the same data.
I want to compare the data in them using Ruby and wanted to know if we already have a Ruby function for the same.
If you want to check whether files are identical you can simply use identical? which is an alias for compare_file:
FileUtils.identical?('file1.csv', 'file2.csv')
If you want to see the differences you might want to use diffy:
gem install diffy
puts Diffy::Diff.new('file1.csv', 'file2.csv', :source => 'files')
It produces diff-like output which can be nicely formatted as HTML:
puts Diffy::Diff.new('file1.csv', 'file2.csv', :source => 'files').to_s(:html_simple)
As Summea commented, look at the CSV class.
Then use:
#Will store each line of each file as an array of fields (so an array of arrays).
file1_lines = CSV.read("file1.csv")
file2_lines = CSV.read("file2.csv")
for i in 0..file1_lines.size
if (file1_lines[i] == file2_lines[i]
puts "Same #{file1_lines[i]}"
else
puts "#{file1_lines[i]} != #{file2_lines[i]}"
end
end
Note that using for in Ruby is quite rare. You normally iterate using an each on the collections, but there are two of them here.
Also, pay attention that one of the list may be longer than the other, but this should get you started.

Ruby parse comma separated text file

I need some help with a Ruby script I can call from the console. The script needs to parse a simple .txt file with comma separated values.
value 1, value2, value3, etc...
The values needs to be added to the database.
Any suggestions?
array = File.read("csv_file.txt").split(",").map(&:strip)
You will get the values in the array and use it to store to database. If you want more functions, you can make use of FasterCSV gem.
Ruby 1.9.2 has a very good CSV library which is useful for this stuff: http://www.ruby-doc.org/stdlib/libdoc/csv/rdoc/index.html
On earlier versions of Ruby you could use http://fastercsv.rubyforge.org/ (which essentially became CSV in 1.9.2)
You could do it manually by reading the file into a string and using .split(',') but I'd go with one of the libraries above.
Quick and dirty solution:
result = []
File.open("<path-to-file>","r") do |handle|
handle.each_line do |line|
result << line.split(",").strip
end
end # closes automatically when EOF reached
result.flatten!
result # => big array of values
Now you can iterate the result array and save the values to the database.
This simple file iteration doesn't take care for order or special fields, because it wasn't mentioned in the question.
Something easy to get you started:
IO.readlines("csv_file.txt", '').each do |line|
values = line.split(",").collect(&:strip)
# do something with the values?
end
Hope this helps.

save/edit array in and outside ruby

I am having an array like "author","post title","date","time","post category", etc etc
I scrape the details from a forum and I want to
save the data using ruby
update the data using ruby
update the data using text editor or I was thinking of one of OpenOffice programs? Calc would be the best.
I guess to have some kind of SQL database would be a solution but I need quick solution for that (somthing that I can do by myself :-)
any suggestions?
Thank you
YAML is your friend here.
require "yaml"
yaml= ["author","post title","date","time","post category"].to_yaml
File.open("filename", "w") do |f|
f.write(yaml)
end
this will give you
---
- author
- post title
- date
- time
- post category
vice versa you get
require "yaml"
YAML.load(File.read("filename")) # => ["author","post title","date","time","post category"]
Yaml is easily human readable, so you can edit it with any text editor (not word proccessor like ooffice). You can not only searialize array's and strings. Yaml works out of the box for most ruby objects, even for objects of user defined classes. This is a good itrodution into the yaml syntax: http://yaml.kwiki.org/?YamlInFiveMinutes.
If you want to use a spreadsheet, csv is the way to go. You can use the stdlib csv api like:
require 'csv'
my2DArray = [[1,2],["foo","bar"]]
File.open('data.csv', 'w') do |outfile|
CSV::Writer.generate(outfile) do |csv|
my2DArray.each do |row|
csv << row
end
end
end
You can then open the resulting file in calc or in most statistics applications.
The same API can be used to re-import the result in ruby if you need.
You could serialize it to json and save it to a file. This would allow you to edit it using a simple text editor.
if you want to edit it in something like calc, you could consider generating a CSV (comma separated values) file and import it.
If I understand correctly, you have a two-dimensional array. You could output it in csv format like so:
array.each do |row|
puts row.join(",")
end
Then you import it with Calc to edit it or just use a text editor.
If your data might contain commas, you should have a look at the csv module instead:
http://ruby-doc.org/stdlib/libdoc/csv/rdoc/index.html

Resources