Removing whitespaces in a CSV file - ruby

I have a string with extra whitespace:
First,Last,Email ,Mobile Phone ,Company,Title ,Street,City,State,Zip,Country, Birthday,Gender ,Contact Type
I want to parse this line and remove the whitespaces.
My code looks like:
namespace :db do
task :populate_contacts_csv => :environment do
require 'csv'
csv_text = File.read('file_upload_example.csv')
csv = CSV.parse(csv_text, :headers => true)
csv.each do |row|
puts "First Name: #{row['First']} \nLast Name: #{row['Last']} \nEmail: #{row['Email']}"
end
end
end

#prices = CSV.parse(IO.read('prices.csv'), :headers=>true,
:header_converters=> lambda {|f| f.strip},
:converters=> lambda {|f| f ? f.strip : nil})
The nil test is added to the row but not header converters assuming that the headers are never nil, while the data might be, and nil doesn't have a strip method. I'm really surprised that, AFAIK, :strip is not a pre-defined converter!

You can strip your hash first:
csv.each do |unstriped_row|
row = {}
unstriped_row.each{|k, v| row[k.strip] = v.strip}
puts "First Name: #{row['First']} \nLast Name: #{row['Last']} \nEmail: #{row['Email']}"
end
Edited to strip hash keys too

CSV supports "converters" for the headers and fields, which let you get inside the data before it's passed to your each loop.
Writing a sample CSV file:
csv = "First,Last,Email ,Mobile Phone ,Company,Title ,Street,City,State,Zip,Country, Birthday,Gender ,Contact Type
first,last,email ,mobile phone ,company,title ,street,city,state,zip,country, birthday,gender ,contact type
"
File.write('file_upload_example.csv', csv)
Here's how I'd do it:
require 'csv'
csv = CSV.open('file_upload_example.csv', :headers => true)
[:convert, :header_convert].each { |c| csv.send(c) { |f| f.strip } }
csv.each do |row|
puts "First Name: #{row['First']} \nLast Name: #{row['Last']} \nEmail: #{row['Email']}"
end
Which outputs:
First Name: 'first'
Last Name: 'last'
Email: 'email'
The converters simply strip leading and trailing whitespace from each header and each field as they're read from the file.
Also, as a programming design choice, don't read your file into memory using:
csv_text = File.read('file_upload_example.csv')
Then parse it:
csv = CSV.parse(csv_text, :headers => true)
Then loop over it:
csv.each do |row|
Ruby's IO system supports "enumerating" over a file, line by line. Once my code does CSV.open the file is readable and the each reads each line. The entire file doesn't need to be in memory at once, which isn't scalable (though on new machines it's becoming a lot more reasonable), and, if you test, you'll find that reading a file using each is extremely fast, probably equally fast as reading it, parsing it then iterating over the parsed file.

Related

Change Headers for Certain Columns in CSV File

I have a CSV file that I want to change the headers only for certain columns (about 20 of them in my actual file). Here's a sample CSV file:
CSV File
"name","blah_01_blah","foo_1_01_foo","bacon_01_bacon","bacon_02_bacon"
"John","yucky","summer","yum","food"
"Mary","","","cool","sundae"
I have been trying this with a File/IO class, but when it reads the file to do the gsub it removes all of the quotation marks around each string separated by commas. Here's the code I'm using:
Ruby Code
file = 'file.csv'
replacements = {
'blah_01_blah' => 'newblah1',
'foo_01_foo' => 'coolfoo1',
'bacon_01_bacon' => 'goodpig1',
'bacon_01_bacon' => 'goodpig2'
}
matcher = /#{replacements.keys.join('|')}/
outdata = File.read(file).gsub(matcher, replacements)
File.open(file, 'w') do |out|
out << outdata
end
What I end up with is this in the CSV file:
New CSV File
name,blah_01_blah,foo_1_01_foo,bacon_01_bacon,bacon_02_bacon
John,yucky,summer,yum,food
Mary,"","",cool,sundae
It's keeping the quotation marks in fields that are blank, but taking them out around the strings elsewhere. I want to retain those quotation marks in case for some reason a rogue comma ends up in a string somewhere so it doesn't get thrown off. How can I change the headers without losing my quotation marks around the strings?
EDIT - This is what I want the file to look like at the end.
Expected Result CSV File
"name","newblah1","coolfoo1","goodpig1","goodpig2"
"John","yucky","summer","yum","food"
"Mary","","","cool","sundae"
Thanks!
You don’t need to handle CSV at all:
File.write(
file,
File.readlines(file).tap do |lines|
lines.first.gsub!(matcher, replacements)
end.join
)
File#readlines.
The trick here is we actually deal with the first line only, as with plain text.
Let's first create the input CSV file.
text =<<_
"name","blah_01_blah","foo_1_01_foo","bacon_01_bacon","bacon_02_bacon"
"John","yucky","summer","yum","food"
"Mary","","","cool","sundae"
_
file_in = 'file_in.csv'
file_out = 'file_out.csv'
File.write(file_in, text)
#=> 137
Here is the replacements hash, which I simplified slightly.
replacements = {'blah_01_blah'=>'newblah1', 'foo_01_foo'=>'coolfoo1',
'bacon_01_bacon'=>'goodpig1'}
The first task is to modify this hash so that if it has no key k, replacements[k] will return k. For this we use the method Hash#default_proc=.
replacements.default_proc = ->(_,k) { k }
Here are two examples of how this hash is used.
replacements['bacon_01_bacon']
#=> "goodpig1"
replacements['name']
#=> "name"`
The latter follows because replacements has no key 'name'.
The code is as follows.
require 'csv'
f_in = CSV.read(file_in, headers:true)
CSV.open(file_out, 'w') do |csv_out|
csv_out << replacements.values_at(*f_in.headers)
f_in.each { |row| csv_out << row }
end
#=> #<CSV::Table mode:col_or_row row_count:3>
Note that
f_in.headers
#=> ["name", "blah_01_blah", "foo_1_01_foo", "bacon_01_bacon", "bacon_02_bacon"]
Let's look at the output file.
puts File.read(file_out)
prints
name,newblah1,foo_1_01_foo,goodpig1,bacon_02_bacon
John,yucky,summer,yum,food
Mary,"","",cool,sundae

Working with large CSV files in Ruby

I want to parse two CSV files of the MaxMind GeoIP2 database, do some joining based on a column and merge the result into one output file.
I used standard CSV ruby library, it is very slow. I think it tries to load all the file in memory.
block_file = File.read(block_path)
block_csv = CSV.parse(block_file, :headers => true)
location_file = File.read(location_path)
location_csv = CSV.parse(location_file, :headers => true)
CSV.open(output_path, "wb",
:write_headers=> true,
:headers => ["geoname_id","Y","Z"] ) do |csv|
block_csv.each do |block_row|
puts "#{block_row['geoname_id']}"
location_csv.each do |location_row|
if (block_row['geoname_id'] === location_row['geoname_id'])
puts " match :"
csv << [block_row['geoname_id'],block_row['Y'],block_row['Z']]
break location_row
end
end
end
Is there another ruby library that support processing in chuncks ?
block_csv is 800MB and location_csv is 100MB.
Just use CSV.open(block_path, 'r', :headers => true).each do |line| instead of File.read and CSV.parse. It will parse the file line by line.
In your current version, you explicitly tell it to read all the file with File.read and then to parse the whole file as a string with CSV.parse. So it does exactly what you have told.

How to remove a row from a CSV with Ruby

Given the following CSV file, how would you remove all rows that contain the word 'true' in the column 'foo'?
Date,foo,bar
2014/10/31,true,derp
2014/10/31,false,derp
I have a working solution, however it requires making a secondary CSV object csv_no_foo
#csv = CSV.read(#csvfile, headers: true) #http://bit.ly/1mSlqfA
#headers = CSV.open(#csvfile,'r', :headers => true).read.headers
# Make a new CSV
#csv_no_foo = CSV.new(#headers)
#csv.each do |row|
# puts row[5]
if row[#headersHash['foo']] == 'false'
#csv_no_foo.add_row(row)
else
puts "not pushing row #{row}"
end
end
Ideally, I would just remove the offending row from the CSV like so:
...
if row[#headersHash['foo']] == 'false'
#csv.delete(true) #Doesn't work
...
Looking at the ruby documentation, it looks like the row class has a delete_if function. I'm confused on the syntax that that function requires. Is there a way to remove the row without making a new csv object?
http://ruby-doc.org/stdlib-1.9.2/libdoc/csv/rdoc/CSV/Row.html#method-i-each
You should be able to use CSV::Table#delete_if, but you need to use CSV::table instead of CSV::read, because the former will give you a CSV::Table object, whereas the latter results in an Array of Arrays. Be aware that this setting will also convert the headers to symbols.
table = CSV.table(#csvfile)
table.delete_if do |row|
row[:foo] == 'true'
end
File.open(#csvfile, 'w') do |f|
f.write(table.to_csv)
end
You might want to filter rows in a ruby manner:
require 'csv'
csv = CSV.parse(File.read(#csvfile), {
:col_sep => ",",
:headers => true
}
).collect { |item| item[:foo] != 'true' }
Hope it help.

CSV.generate and converters?

I'm trying to create a converter to remove newline characters from CSV output.
I've got:
nonewline=lambda do |s|
s.gsub(/(\r?\n)+/,' ')
end
I've verified that this works properly IF I load a variable and then run something like:
csv=CSV(variable,:converters=>[nonewline])
However, I'm attempting to use this code to update a bunch of preexisting code using CSV.generate, and it does not appear to work at all.
CSV.generate(:converters=>[nonewline]) do |csv|
csv << ["hello\ngoodbye"]
end
returns:
"\"hello\ngoodbye\"\n"
I've tried quite a few things as well as trying other examples I've found online, and it appears as though :converters has no effect when used with CSV.generate.
Is this correct, or is there something I'm missing?
You need to write your converter as as below :
CSV::Converters[:nonewline] = lambda do |s|
s.gsub(/(\r?\n)+/,' ')
end
Then do :
CSV.generate(:converters => [:nonewline]) do |csv|
csv << ["hello\ngoodbye"]
end
Read the documentation Converters .
Okay, above part I didn't remove, as to show you how to write the custom CSV converters. The way you wrote it is incorrect.
Read the documentation of CSV::generate
This method wraps a String you provide, or an empty default String, in a CSV object which is passed to the provided block. You can use the block to append CSV rows to the String and when the block exits, the final String will be returned.
After reading the docs, it is quite clear that this method is for writing to a csv file, not for reading. Now all the converters options ( like :converters, :header_converters) is applied, when you are reading a CSV file, but not applied when you are writing into a CSV file.
Let me show you 2 examples to illustrate this more clearly.
require 'csv'
string = <<_
foo,bar
baz,quack
_
File.write('a',string)
CSV::Converters[:upcase] = lambda do |s|
s.upcase
end
I am reading from a CSV file, so :converters option is applied to it.
CSV.open('a','r',:converters => :upcase) do |csv|
puts csv.read
end
output
# >> FOO
# >> BAR
# >> BAZ
# >> QUACK
Now I am writing into the CSV file, converters option is not applied.
CSV.open('a','w',:converters => :upcase) do |csv|
csv << ['dog','cat']
end
CSV.read('a') # => [["dog", "cat"]]
Attempting to remove newlines using :converters did not work.
I had to override the << method from csv.rb adding the following code to it:
# Change all CR/NL's into one space
row.map! { |element|
if element.is_a?(String)
element.gsub(/(\r?\n)+/,' ')
else
element
end
}
Placed right before
output = row.map(&#quote).join(#col_sep) + #row_sep # quote and separate
at line 21.
I would think this would be a good patch to CSV, as newlines will always produce bad CSV output.

Can I delete columns from CSV using Ruby?

Looking at the documentation for the CSV library of Ruby, I'm pretty sure this is possible and easy.
I simply need to delete the first three columns of a CSV file using Ruby but I haven't had any success getting it run.
csv_table = CSV.read(file_path_in, :headers => true)
csv_table.delete("header_name")
csv_table.to_csv # => The new CSV in string format
Check the CSV::Table documentation: http://ruby-doc.org/stdlib-1.9.2/libdoc/csv/rdoc/CSV/Table.html
csv_table = CSV.read("../path/to/file.csv", :headers => true)
keep = ["x", "y", "z"]
new_csv_table = csv_table.by_col!.delete_if do |column_name,column_values|
!keep.include? column_name
end
new_csv_table.to_csv
What about:
require 'csv'
File.open("resfile.csv","w+") do |f|
CSV.foreach("file.csv") do |row|
f.puts(row[3..-1].join(","))
end
end
I have built on a few of the questions (really liked what #fguillen did with CSV::Table) here but just made it a bit simpler to drop it into an existing project, target a file and make a quick change.
Have added byebug cause ... yes. Then also retained the headers from the original file (assuming they exist for anyone wanting to use this snippet).
The file is overwritten each time in case you want to test/tinker.
require 'csv'
require 'byebug'
in_file = './db/data/inbox/change__to_file_name.csv'
out_file = in_file + ".out"
target_col = "change_to_column_name"
csv_table = CSV.read(in_file, headers: true)
csv_table.delete(target_col)
CSV.open(out_file, 'w+', force_quotes: true) do |csv|
csv << csv_table.headers
csv_table.each_with_index do |row|
csv << row
end
end

Resources