Ignoring multiple header lines in a CSV - ruby

I've worked a bit with Ruby's CSV module, but am having some problems getting it to ignore multiple header lines.
Specifically, here are the first twenty lines of a file I want to parse:
USGS Digital Spectral Library splib06a
Clark and others 2007, USGS, Data Series 231.
For further information on spectrsocopy, see: http://speclab.cr.usgs.gov
ASCII Spectral Data file contents:
line 15 title
line 16 history
line 17 to end: 3-columns of data:
wavelength reflectance standard deviation
(standard deviation of 0.000000 means not measured)
( -1.23e34 indicates a deleted number)
----------------------------------------------------
Olivine GDS70.a Fo89 165um W1R1Bb AREF
copy of splib05a r 5038
0.205100 -1.23e34 0.090781
0.213100 -1.23e34 0.018820
0.221100 -1.23e34 0.005416
0.229100 -1.23e34 0.002928
The actual headers are given on the tenth line, and the seventeenth line is where the actual data start.
Here's my code:
require "nyaplot"
# Note that DataFrame basically just inherits from Ruby's CSV module.
class SpectraHelper < Nyaplot::DataFrame
class << self
def from_csv filename
df = super(filename, col_sep: ' ') do |csv|
csv.convert do |field, info|
STDERR.puts "Field is #{field}"
end
end
end
end
def csv_headers
[:wavelength, :reflectance, :standard_deviation]
end
end
def read_asc filename
f = File.open(filename, "r")
16.times do
line = f.gets
puts "Ignoring #{line}"
end
d = SpectraHelper.from_csv(f)
end
The output suggests that my calls to f.gets are not actually ignoring those lines, and I can't understand why. Here are the first few lines of output:
Field is Clark
Field is and
Field is others
Field is 2007,
Field is USGS,
I tried looking for a tutorial or example which shows processing of more complicated CSV files, but haven't had much luck. If someone could point me towards a resource which answers this question, I would be grateful (and would prefer to mark that as accepted over a solution to my specific problem — but both would be appreciated).
Using Ruby 2.1.

It believe that you are using ::open which uses IO.open. This method will open the file again.
I modified the script a bit
require 'csv'
class SpectraHelper < CSV
def self.from_csv(filename)
df = open(filename, 'r' , col_sep: ' ') do |csv|
csv.drop(16).each {|c| p c}
end
end
end
def read_asc(filename)
SpectraHelper.from_csv(filename)
end
read_asc "data/csv1.csv"

It turns out the problem here was not with my understanding of CSV, but rather with now Nyaplot::DataFrame handles CSV files.
Basically, Nyaplot doesn't actually store things as CSVs. CSV is just an intermediate format. So a simple way to handle the files makes use of #khelli's suggestion:
def read_asc filename
Nyaplot::DataFrame.new(CSV.open(filename, 'r',
col_sep: ' ',
headers: [:wavelength, :reflectance, :standard_deviation],
converters: :numeric).
drop(16).
map do |csv_row|
csv_row.to_h.delete_if { |k,v| k.nil? }
end)
end
Thanks, everyone, for the suggestions.

I wouldn't use the CSV module since your file is not well formatted. the following code will read the file and give you an array of your records:
lines = File.open(filename,'r').readlines
lines.slice!(0,16)
records = lines.map {|line| line.chomp.split}
the recordsoutput:
[["0.205100", "-1.23e34", "0.090781"], ["0.213100", "-1.23e34", "0.018820"], ["0.221100", "-1.23e34", "0.005416"], ["0.229100", "-1.23e34", "0.002928"]]

Related

How to remove headers and second column in CSV in ruby?

I have a CSV that looks like this:
user_id,is_user_unsubscribed
131072,1
7077888,1
11010048,1
12386304,1
327936,1
2228480,1
6553856,1
9830656,1
10158336,1
10486016,1
10617088,1
11010304,1
11272448,1
393728,1
7012864,1
8782336,1
11338240,1
11928064,1
4326144,1
8127232,1
11862784,1
but I want the data to look like this:
131072
7077888
11010048
12386304
327936
...
any ideas on what to do? I have 330,000 rows...
You can read your file as an array and ignore the first row like this:
data = CSV.read("dataset.csv")[1 .. -1]
This way you can remove the header.
Regarding the column, you can delete a column like this:
data = CSV.read("dataset.csv")[1 .. -1]
data.delete("is_user_unsubscribed")
data.to_csv # => The new CSV in string format
Check this for more info: http://ruby-doc.org/stdlib-1.9.2/libdoc/csv/rdoc/CSV/Table.html
http://ruby-doc.org/stdlib-2.0.0/libdoc/csv/rdoc/CSV.html
My recommendation would be to read in a line from your file as a string, then split the String that you get by commas (there is a comma separating your columns).
Splitting a Ruby String:
https://code-maven.com/ruby-split
require 'pp'
line_num=0
text=File.open('myfile.csv').read
text.each_line do |line|
textArray = line.split
textIWant = textArray[0]
line_num = line_num + 1
print "#{textIWant}"
end
In this code we open a text file, and read line by line. Each line we split into the text we want by choosing the text from the first column (zeroth item in the array), then print it.
If you do not want the headers, when line_num = 0, add an if statement to not pick up the data. Even better use unless.
Just rewrite a new file with your new data.
I wound up doing this. Is this kosher?
user_ids = []
[]
CSV.foreach("eds_users_sept15.csv", headers:true) do |row|
user_ids << row['user_id']
end
nil
user_ids.count
322101
CSV.open('some_new_file.csv', 'w') do |c|
user_ids.each do |id|
c << [id]
end
end
I have 330,000 rows...
So I guess speed matters, right?
I took your method and the other 2 that was proposed, tested them on a 330,000 rows csv file and made a benchmark to show you something interesting.
require 'csv'
require 'benchmark'
Benchmark.bm(10) do |bm|
bm.report("Method 1:") {
data = Array.new
CSV.foreach("input.csv", headers:true) do |row|
data << row['user_id']
end
}
bm.report("Method 2:") {
data = CSV.read("input.csv")[1 .. -1]
data.delete("is_user_unsubscribed")
}
bm.report("Method 3:") {
data = Array.new
File.open('input.csv').read.each_line do |line|
data << line.split(',')[0]
end
data.shift # => remove headers
}
end
The output:
user system total real
Method 1: 3.110000 0.010000 3.120000 ( 3.129409)
Method 2: 1.990000 0.010000 2.000000 ( 2.004016)
Method 3: 0.380000 0.010000 0.390000 ( 0.383700)
As you can see handling the CSV file as a simple text file, splitting the lines and pushing them into the array is ~5 times faster than using CSV Module. Of course it has some disadvantages too; i.e., if you'll ever add columns in the input file you'll have to review the code.
It's up to you if you prefer lightspeed code or easier scalability.
I'm guessing that you plan to convert each string that precedes a comma to an integer. If so,
CSV.read("dataset.csv").drop(1).map(:to_i)
is all you need. (For example, "131072,1".to_i #=> 131072.)
If you want strings, you could write
CSV.read("dataset.csv").drop(1).map { |s| s[/d+/] }

CSV.generate and converters?

I'm trying to create a converter to remove newline characters from CSV output.
I've got:
nonewline=lambda do |s|
s.gsub(/(\r?\n)+/,' ')
end
I've verified that this works properly IF I load a variable and then run something like:
csv=CSV(variable,:converters=>[nonewline])
However, I'm attempting to use this code to update a bunch of preexisting code using CSV.generate, and it does not appear to work at all.
CSV.generate(:converters=>[nonewline]) do |csv|
csv << ["hello\ngoodbye"]
end
returns:
"\"hello\ngoodbye\"\n"
I've tried quite a few things as well as trying other examples I've found online, and it appears as though :converters has no effect when used with CSV.generate.
Is this correct, or is there something I'm missing?
You need to write your converter as as below :
CSV::Converters[:nonewline] = lambda do |s|
s.gsub(/(\r?\n)+/,' ')
end
Then do :
CSV.generate(:converters => [:nonewline]) do |csv|
csv << ["hello\ngoodbye"]
end
Read the documentation Converters .
Okay, above part I didn't remove, as to show you how to write the custom CSV converters. The way you wrote it is incorrect.
Read the documentation of CSV::generate
This method wraps a String you provide, or an empty default String, in a CSV object which is passed to the provided block. You can use the block to append CSV rows to the String and when the block exits, the final String will be returned.
After reading the docs, it is quite clear that this method is for writing to a csv file, not for reading. Now all the converters options ( like :converters, :header_converters) is applied, when you are reading a CSV file, but not applied when you are writing into a CSV file.
Let me show you 2 examples to illustrate this more clearly.
require 'csv'
string = <<_
foo,bar
baz,quack
_
File.write('a',string)
CSV::Converters[:upcase] = lambda do |s|
s.upcase
end
I am reading from a CSV file, so :converters option is applied to it.
CSV.open('a','r',:converters => :upcase) do |csv|
puts csv.read
end
output
# >> FOO
# >> BAR
# >> BAZ
# >> QUACK
Now I am writing into the CSV file, converters option is not applied.
CSV.open('a','w',:converters => :upcase) do |csv|
csv << ['dog','cat']
end
CSV.read('a') # => [["dog", "cat"]]
Attempting to remove newlines using :converters did not work.
I had to override the << method from csv.rb adding the following code to it:
# Change all CR/NL's into one space
row.map! { |element|
if element.is_a?(String)
element.gsub(/(\r?\n)+/,' ')
else
element
end
}
Placed right before
output = row.map(&#quote).join(#col_sep) + #row_sep # quote and separate
at line 21.
I would think this would be a good patch to CSV, as newlines will always produce bad CSV output.

How to write columns header to a csv file with Ruby?

I am having trouble writing columns to a csv file with Ruby. Below is my snippet of code.
calc = numerator/denominator.to_f
data_out = "#{numerator}, #{denominator}, #{calc}"
File.open('cdhu3_X.csv','a+') do|hdr|
hdr << ["numerator","denominator","calculation\n"] #< column header
hdr << "#{data_out}\n"
end
The code adds the column headers to every line and I only need it at the top of each column of data. I have searched here and other places but can't find a clear answer to how its done.
Any help would be greatly appreciated.
I would recommend to use the CSV-library instead:
require 'csv'
CSV.open('test.csv','w',
:write_headers=> true,
:headers => ["numerator","denominator","calculation"] #< column header
) do|hdr|
1.upto(12){|numerator|
1.upto(12){ |denominator|
data_out = [numerator, denominator, numerator/denominator.to_f]
hdr << data_out
}
}
end
If you can't use the w option and you really need the a+ (e.g., the data isn't available all at once), then you could try the following trick:
require 'csv'
column_header = ["numerator","denominator","calculation"]
1.upto(12){|numerator|
1.upto(12){ |denominator|
CSV.open('test.csv','a+',
:write_headers=> true,
:headers => column_header
) do|hdr|
column_header = nil #No header after first insertion
data_out = [numerator, denominator, numerator/denominator.to_f]
hdr << data_out
end
}
}
The cleanest way to do this is to open the file once, in mode 'w', write the headers, and then write the data.
If there's some technical reason that can't do this (e.g., the data isn't available all at once), then you can use the IO#tell method on the file to return the current file position. When you open the file for appending, the position is set to the end of the file, so if the current file position is zero, then the file was newly created and has no headers:
File.open('cdhu3_X.csv', 'a+') do |hdr|
if hdr.tell() == 0 # file is empty, so write header
hdr << "numerator, denominator, calculation\n"
end
hdr << "#{data_out}\n"
end
Best way to handle csv file is to use Ruby's CSV module.
I had same problem after reading CSV code I came across this solution which i find most efficient.
headers = ['col1','col2','col3']
CSV.open(file_path, 'a+', {force_quotes: true}) do |csv|
csv << headers if csv.count.eql? 0 # csv.count method gives number of lines in file if zero insert headers
end
This works for me
headers = ["Reference Number", "Vendor Line Code"]
CSV.open(file_path, "wb") do |csv|
csv << headers
#vendor.vendor_items.each do |vi|
row_data = [vi.reference_number, vi.line_code]
csv << row_data
end
end

Ruby: How to replace text in a file?

The following code is a line in an xml file:
<appId>455360226</appId>
How can I replace the number between the 2 tags with another number using ruby?
There is no possibility to modify a file content in one step (at least none I know, when the file size would change).
You have to read the file and store the modified text in another file.
replace="100"
infile = "xmlfile_in"
outfile = "xmlfile_out"
File.open(outfile, 'w') do |out|
out << File.open(infile).read.gsub(/<appId>\d+<\/appId>/, "<appId>#{replace}</appId>")
end
Or you read the file content to memory and afterwords you overwrite the file with the modified content:
replace="100"
filename = "xmlfile_in"
outdata = File.read(filename).gsub(/<appId>\d+<\/appId>/, "<appId>#{replace}</appId>")
File.open(filename, 'w') do |out|
out << outdata
end
(Hope it works, the code is not tested)
You can do it in one line like this:
IO.write(filepath, File.open(filepath) {|f| f.read.gsub(//<appId>\d+<\/appId>/, "<appId>42</appId>"/)})
IO.write truncates the given file by default, so if you read the text first, perform the regex String.gsub and return the resulting string using File.open in block mode, it will replace the file's content in one fell swoop.
I like the way this reads, but it can be written in multiple lines too of course:
IO.write(filepath, File.open(filepath) do |f|
f.read.gsub(//<appId>\d+<\/appId>/, "<appId>42</appId>"/)
end
)
replace="100"
File.open("xmlfile").each do |line|
if line[/<appId>/ ]
line.sub!(/<appId>\d+<\/appId>/, "<appId>#{replace}</appId>")
end
puts line
end
The right way is to use an XML parsing tool, and example of which is XmlSimple.
You did tag your question with regex. If you really must do it with a regex then
s = "Blah blah <appId>455360226</appId> blah"
s.sub(/<appId>\d+<\/appId>/, "<appId>42</appId>")
is an illustration of the kind of thing you can do but shouldn't.

trying to get the delta between columns using FasterCSV

A bit of a noob here so apologies in advance.
I am trying to read a CSV file which has a number of columns, I would like see if one string "foo" exists anywhere in the file, and if so, grab the string one cell over (aka same row, one column over) and then write that to a file
my file c.csv:
foo,bar,yip
12,apple,yap
23,orange,yop
foo,tom,yum
so in this case, I would want "bar" and "tom" in a new csv file.
Here's what I have so far:
#!/usr/local/bin/ruby -w
require 'rubygems'
require 'fastercsv'
rows = FasterCSV.read("c.csv")
acolumn = rows.collect{|row| row[0]}
if acolumn.select{|v| v =~ /foo/} == 1
i = 0
for z in i..(acolumn).count
puts rows[1][i]
end
I've looked here https://github.com/circle/fastercsv/blob/master/examples/csv_table.rb but I am obviously not understanding it, my best guess is that I'd have to use Table to do what I want to do but after banging my head up against the wall for a bit, I decided to ask for advice from the experienced folks. help please?
Given your input file c.csv
foo,bar,yip
12,apple,yap
23,orange,yop
foo,tom,yum
then this script:
#!/usr/bin/ruby1.8
require 'fastercsv'
FasterCSV.open('output.csv', 'w') do |output|
FasterCSV.foreach('c.csv') do |row|
foo_index = row.index('foo')
if foo_index
value_to_the_right_of_foo = row[foo_index + 1]
output << value_to_the_right_of_foo
end
end
end
will create the file output.csv
bar
tom

Resources