Why do I have a trailing column when reading a CSV file? - ruby

I have a CSV file whith the following structure:
"customer_id";"customer_name";"quantity";
"id1234";"Henry";"15";
Parsing with Ruby's standard CSV lib:
csv_data = CSV.read(pathtofile,{
:headers => :first_row,
:col_sep => ";",
:quote_char => '"'
:row_sep => "\r\n" #setting it to "\r" or "\n" results in MalformedCSVError
})
puts csv_data.headers.count #4
I don't understand why the parsing seems to result in four columns although the file only contains three. Is this not the right approach to parse the file?

The ; at the end of each row is implying another field, even though there is no value.
I would either remove the trailing ;'s or just ignore the fourth field when it is parsed.

The trailing ; is the culprit.
You can preprocess the file, stripping the trailing ;, but that incurs unnecessary overhead.
You can post-process the returned array of data from CSV using something like this:
csv_data = CSV.read(...).map(&:pop)
That will iterate over the sub-arrays, removing the last element in each. The problem is that read isn't scalable, so you might want to rethink using it and instead, use CSV.foreach to read the file line by line and then pop the last value as they're returned to you.

Related

Deleting an unwanted character in a string

I am parsing a CSV file and converting each element in UTF-8 :
CSV.foreach(#data_source, { :col_sep => ';' , quote_char: "\x00", :encoding => "CP850"}) do |row|
row.map! {|x| x.force_encoding('UTF-8') unless x.nil? ; x.scrub!("") unless x.nil? ; x.delete('\u2FEC') unless x.nil? }
end
The script then does a bunch of calculations and then saves the data in xlsx format using axlsx gem.
I added x.delete('\u2FEC') unless x.nil? because I found that in the source file, there was this strange sequence that later causes and "Unreadable content" error in Excel.
I found that it solves the "Unreadable content" issue but it not only deletes the "\u2FEC" sequence, it deletes every occurence of the character "2" too.
Do you have any idea how I can get rid of only "\u2FEC" and not of every "2" in my rows ?
Thanks.
Single-quoted strings don't support Unicode escapes. (In fact, they don't support any escapes other than \' and \\.)
You need to use either a double-quoted string or enter the character directly into the single-quoted string instead of a Unicode escape sequence.

Skip rows before the header in CSV file [duplicate]

This question already has answers here:
How to skip the first line of a CSV file and make the second line the header
(6 answers)
Closed 7 years ago.
I tried searching but couldn't find a question regarding my problem. Let's say I have a CSV file that looks something like this:
Metadata line 1
Metadata line 2
Metadata line 3
Metadata line 4
foo,bar,baz
apple,orange,banana
cashew,almond,walnut
The line foo,bar,baz is the header, and the following lines are the corresponding data. When I write my ruby code like so:
CSV.foreach("filename.csv",:headers=>true) do |row|
puts "#{row}"
end
It clearly breaks. What's the best way to skip the lines before the header? Currently I'm thinking I could do something like:
Find the first row with commas and get line number
Extract that line as an array
Pass that array to :headers
But this feels cumbersome - if I know exactly what line the header is, what's the best way to jump to that line and ignore everything previously? Is this possible? If this is a question that has been asked before, I will happily devour those answers, perhaps my search-fu just isn't good enough.
Thank you so much!
There is a skip_lines option to CSV. Not exactly clear if it will skip header lines or just rows, but worth a shot.
:skip_lines - When set to an object responding to match, every line
matching it is considered a comment and ignored during parsing. When
set to a String, it is first converted to a Regexp. When set to nil no
line is considered a comment. If the passed object does not respond to
match, ArgumentError is thrown.
If you know how many metadata lines there are, you can just eat them before creating the CSV object.
Can could of course also do something useful with them, but that's up to you!
require 'csv'
3.times { DATA.readline }
csv = CSV.new(DATA, headers: true, return_headers: false)
csv.read.each do |row|
p row
end
# => #<CSV::Row "header1":"1" "header2":"2">
# => #<CSV::Row "header1":"3" "header2":"4">
# => #<CSV::Row "header1":"5" "header2":"6">
p csv.headers
# => ["header1", " header2"]
__END__
# I know
# there are 3 lines
# here, so I can skip them.
header1,header2
1,2
3,4
5,6
You can do something like:
require 'csv'
while (header = DATA.readline) !~ /,.*,/
end
csv = CSV.new(DATA.read, headers: header)
csv.each do |row|
p row
end
p csv.headers
__END__
Metadata line 1
Metadata line 2
Metadata line 3
Metadata line 4
foo,bar,baz
apple,orange,banana
cashew,almond,walnut
One warning: Nicks 3rd data line (# here, so I can skip them.) contains only one comma. So your rule Find the first row with commas could lead to a misunderstanding. You can use the regex /,.*,/ but then you must have at least two commas in the header to be detected as the header.
In other words: It is essential to have maximum one comma before the header line and to have more then one comma in the real header line.
Remark 2: DATA is a special ruby construct that can be replaced with a file handle (e.g. the f in File.open(filename){|f| ...}.

Ruby CSV.open need to remove quotes and null characters

I am retrieving a large hash of results from a database query and writing them to a csv file. The code block below takes the results and creates the CSV. With the quote_char: option it will replace the quotes with NULL characters which I need to properly create the tab-delimited file.
However, the NULL characters are getting converted into "" when they are loaded into their destination so I would like to remove those. If I leave out quote_char: every field is double quoted which causes the same result.
How can I remove the NULL characters?
begin
CSV.open("#{file_path}"'file.tab', "wb", Options = {col_sep: "\t", quote_char: "\0"}) do |csv|
csv << ["Key","channel"]
series_1_results.each_hash do |series_1|
csv << ["#{series_1['key']}","#{series_1['channel']}"]
end
end
end
As it is stated in the csv documentation you have to the set quote_char to some character, and this character will always be used to quote empty fields.
It seems the only solution in this case is to remove used quote_chars from the created csv file. You can do it like this:
quotedFile = File.read("#{file_path}"'file.tab')
unquotedFile = quotedFile.gsub("\0", "")
File.open("#{file_path}"'unquoted_file.tab',"w") { |file| file.puts replace }
I assume here that NULL's are the only escaped fields. If that's not the case use default quote_char: '"' and gsub(',"",', '') which should handle almost all possible cases of fields containing special characters.
But as you note that the results of your query are large it might be more practical to prepare the csv file on your own and avoid processing the outputs twice. You could simply write:
File.open("#{file_path}"'unquoted_file.tab',"w") do |file|
csv.puts ["Key","channel"]
series_1_results.each_hash do |series_1|
csv.puts ["#{series_1['key']},#{series_1['channel']}"]
end
end
Once more, you might need to handle fields with special characters.
From the Ruby CSV docs, setting force_quotes: false in the options seems to work.
CSV.open("#{file_path}"'file.tab', "wb", { col_sep: "\t", force_quotes: false }) do |csv|
The above does the trick. I'd suggest against setting quote_char to \0 since that doesn't work as expected.
There is one thing to note though. If the field is an empty string "" - it will force the quote_char to be printed into the CSV. But strangely a nil value does not. I'd suggest that if at all you're expecting empty strings in the data, you somehow convert them into nil when writing to the CSV (maybe using the ActiveSupport presence method or anything similar).
First, a tab-separated file is "TSV", vs. a comma-separated file which is "CSV".
Wrapping quotes around fields is necessary any time there could be an occurrence of the field delimiter inside a field.
For instance, how are you going to embed this string in a tab-delimited file?
Foo\tbar
The \t is the representation of an embedded Tab.
The same problem occurs when writing a CSV file with a field containing commas. The field has to be wrapped in double-quotes to delimit the field itself.
If your input contains any data that needs to be escaped (such as the column separator or the quote character), then you do need to quote your data. Otherwise it cannot be parsed correctly later.
CSV.open('test.csv', 'wb', col_sep: "\t") do |csv|
csv << ["test", "'test'", '"test"', nil, "test\ttest"]
end
puts open('test.csv').read
#test 'test' """test""" "test test"
The CSV class won't quote anything unnecessarily (as you can see above). So I'm not sure why you're saying all your fields are being quoted. It could be somehow force_quotes is getting set to true somewhere.
If you're absolutely certain your data will never contain \t or ", then the default quote_char (") should work just fine. Otherwise, if you want to avoid quoting anything, you'll need to pick another quote character that you're absolutely certain won't occur in your data.
CSV.open('test.csv', 'wb', col_sep: "\t", quote_char: "|") do |csv|
csv << ["test", "'test'", nil, '"test"']
end
puts open('test.csv').read
#test 'test' "test"

How can I further process the line of data that causes the Ruby FasterCSV library to throw a MalformedCSVError?

The incoming data file(s) contain malformed CSV data such as non-escaped quotes, as well as (valid) CSV data such as fields containing new lines. If a CSV format error is detected I would like to use an alternative routine on that data.
With the following sample code (abbreviated for simplicity)
FasterCSV.open( file ){|csv|
row = true
while row
begin
row = csv.shift
break unless row
# Do things with the good rows here...
rescue FasterCSV::MalformedCSVError => e
# Do things with the bad rows here...
next
end
end
}
The MalformedCSVError is caused in the csv.shift method. How can I access the data that caused the error from the rescue clause?
require 'csv' #CSV in ruby 1.9.2 is identical to FasterCSV
# File.open('test.txt','r').each do |line|
DATA.each do |line|
begin
CSV.parse(line) do |row|
p row #handle row
end
rescue CSV::MalformedCSVError => er
puts er.message
puts "This one: #{line}"
# and continue
end
end
# Output:
# Unclosed quoted field on line 1.
# This one: 1,"aaa
# Illegal quoting on line 1.
# This one: aaa",valid
# Unclosed quoted field on line 1.
# This one: 2,"bbb
# ["bbb", "invalid"]
# ["3", "ccc", "valid"]
__END__
1,"aaa
aaa",valid
2,"bbb
bbb,invalid
3,ccc,valid
Just feed the file line by line to FasterCSV and rescue the error.
This is going to be really difficult. Some things that make FasterCSV, well, faster, make this particularly hard. Here's my best suggestion: FasterCSV can wrap an IO object. What you could do, then, is to make your own subclass of File (itself a subclass of IO) that "holds onto" the result of the last gets. Then when FasterCSV raises an exception you can ask your special File object for the last line. Something like this:
class MyFile < File
attr_accessor :last_gets
#last_gets = ''
def gets(*args)
line = super
#last_gets << $/ << line
line
end
end
# then...
file = MyFile.open(filename, 'r')
csv = FasterCSV.new file
row = true
while row
begin
break unless row = csv.shift
# do things with the good row here...
rescue FasterCSV::MalformedCSVError => e
bad_row = file.last_gets
# do something with bad_row here...
next
ensure
file.last_gets = '' # nuke the #last_gets "buffer"
end
end
Kinda neat, right? BUT! there are caveats, of course:
I'm not sure how much of a performance hit you take when you add an extra step to every gets call. It might be an issue if you need to parse multi-million-line files in a timely fashion.
This fails utterly might or might not fail if your CSV file contains newline characters inside quoted fields. The reason for this is described in the source--basically, if a quoted value contains a newline then shift has to do additional gets calls to get the entire line. There could be a clever way around this limitation but it's not coming to me right now. If you're sure your file doesn't have any newline characters within quoted fields then this shouldn't be a worry for you, though.
Your other option would be to read the file using File.gets and pass each line in turn to FasterCSV#parse_line but I'm pretty sure in so doing you'd squander any performance advantage gained from using FasterCSV.
I used Jordan's file subclassing approach to fix the problem with my input data before CSV ever tries to parse it. In my case, I had a file that used \" to escape quotes, instead of the "" that CSV expects. Hence,
class MyFile < File
def gets(*args)
line = super
if line != nil
line.gsub!('\\"','""') # fix the \" that would otherwise cause a parse error
end
line
end
end
infile = MyFile.open(filename)
incsv = CSV.new(infile)
while row = infile.shift
# process each row here
end
This allowed me to parse the non-standard CSV file. Ruby's CSV implementation is very strict and often has trouble with the many variants of the CSV format.

What's the best way to parse a tab-delimited file in Ruby?

What's the best (most efficient) way to parse a tab-delimited file in Ruby?
The Ruby CSV library lets you specify the field delimiter. Ruby 1.9 uses FasterCSV. Something like this would work:
require "csv"
parsed_file = CSV.read("path-to-file.csv", col_sep: "\t")
The rules for TSV are actually a bit different from CSV. The main difference is that CSV has provisions for sticking a comma inside a field and then using quotation characters and escaping quotes inside a field. I wrote a quick example to show how the simple response fails:
require 'csv'
line = 'boogie\ttime\tis "now"'
begin
line = CSV.parse_line(line, col_sep: "\t")
puts "parsed correctly"
rescue CSV::MalformedCSVError
puts "failed to parse line"
end
begin
line = CSV.parse_line(line, col_sep: "\t", quote_char: "Ƃ")
puts "parsed correctly with random quote char"
rescue CSV::MalformedCSVError
puts "failed to parse line with random quote char"
end
#Output:
# failed to parse line
# parsed correctly with random quote char
If you want to use the CSV library you could used a random quote character that you don't expect to see if your file (the example shows this), but you could also use a simpler methodology like the StrictTsv class shown below to get the same effect without having to worry about field quotations.
# The main parse method is mostly borrowed from a tweet by #JEG2
class StrictTsv
attr_reader :filepath
def initialize(filepath)
#filepath = filepath
end
def parse
open(filepath) do |f|
headers = f.gets.strip.split("\t")
f.each do |line|
fields = Hash[headers.zip(line.split("\t"))]
yield fields
end
end
end
end
# Example Usage
tsv = Vendor::StrictTsv.new("your_file.tsv")
tsv.parse do |row|
puts row['named field']
end
The choice of using the CSV library or something more strict just depends on who is sending you the file and whether they are expecting to adhere to the strict TSV standard.
Details about the TSV standard can be found at http://en.wikipedia.org/wiki/Tab-separated_values
There are actually two different kinds of TSV files.
TSV files that are actually CSV files with a delimiter set to Tab. This is something you'll get when you e.g. save an Excel spreadsheet as "UTF-16 Unicode Text". Such files use CSV quoting rules, which means that fields may contain tabs and newlines, as long as they are quoted, and literal double quotes are written twice. The easiest way to parse everything correctly is to use the csv gem:
use 'csv'
parsed = CSV.read("file.tsv", col_sep: "\t")
TSV files conforming to the IANA standard. Tabs and newlines are not allowed as field values, and there is no quoting whatsoever. This is something you will get when you e.g. select a whole Excel spreadsheet and paste it into a text file (beware: it will get messed up if some cells do contain tabs or newlines). Such TSV files can be easily parsed line-by-line with a simple line.rstrip.split("\t", -1) (note -1, which prevents split from removing empty trailing fields). If you want to use the csv gem, simply set quote_char to nil:
use 'csv'
parsed = CSV.read("file.tsv", col_sep: "\t", quote_char: nil)
I like mmmries answer. HOWEVER, I hate the way that ruby strips off any empty values off of the end of a split. It isn't stripping off the newline at the end of the lines, either.
Also, I had a file with potential newlines within a field. So, I rewrote his 'parse' as follows:
def parse
open(filepath) do |f|
headers = f.gets.strip.split("\t")
f.each do |line|
myline=line
while myline.scan(/\t/).count != headers.count-1
myline+=f.gets
end
fields = Hash[headers.zip(myline.chomp.split("\t",headers.count))]
yield fields
end
end
end
This concatenates any lines as necessary to get a full line of data, and always returns the full set of data (without potential nil entries at the end).

Resources