I am using ruby CSV.read with massive data. From time to time the library encounters poorly formatted lines, for instance:
"Illegal quoting in line 53657."
It would be easier to ignore the line and skip it, then to go through each csv and fix the formatting. How can I do this?
I had this problem in a line like 123,456,a"b"c
The problem is the CSV parser is expecting ", if they appear, to entirely surround the comma-delimited text.
Solution use a quote character besides " that I was sure would not appear in my data:
CSV.read(filename, :quote_char => "|")
The liberal_parsing option is available starting in Ruby 2.4 for cases like this. From the documentation:
When set to a true value, CSV will attempt to parse input not conformant with RFC 4180, such as double quotes in unquoted fields.
To enable it, pass it as an option to the CSV read/parse/new methods:
CSV.read(filename, liberal_parsing: true)
Don't let CSV both read and parse the file.
Just read the file yourself and hand each line to CSV.parse_line, and then rescue any exceptions it throws.
Try forcing double quote character " as quote char:
require 'csv'
CSV.foreach(file,{headers: :first_row, quote_char: "\x00"}) do |line|
p line
end
Apparently this error can also be caused by unprintable BOM characters. This thread suggests using a file mode to force a conversion, which is what finally worked for me.
require 'csv'
CSV.open(#filename, 'r:bom|utf-8') do |csv|
# do something
end
Related
When processing a file, I used to use the special variable $. to get the last line number being read. For instance, the following program
require 'csv'
IFS=';'
CSV_OPTIONS = { col_sep: IFS, external_encoding: Encoding::ISO_8859_1, internal_encoding: Encoding::UTF_8 }
CSV.new($stdin, CSV_OPTIONS).each do |row|
puts "::::line #{$.} row=#{row}"
end
is supposed to dump a CSV file (where the fields are delimited by semicolon instead of comma, as is the case in our project) and prepend each output line by the line number.
After updating Ruby to
_ruby 2.6.3p62 (2019-04-16 revision 67580) [x86_64-cygwin]_
the lines are still dumped, but the line number is always displayed as zero.
What strikes me, is that this Ruby Wiki on special Ruby variables, while still having $. in its list, doesn't have a description for this variable anymore. So I wonder: Is this variable gone, or was it never supposed to work with the csv class and just worked for me by accident in the earlier versions?
I'm not sure why $. isn't working for you, but it's also not the best solution here. When it works, $. gives you the number of lines read from input, but since quoted fields in a CSV file can span multiple lines the number you get from $. won't always be the number of rows that have been read.
As mentioned above, each_with_index is a good alternative:
CSV.new($stdin, CSV_OPTIONS).each_with_index do |row, i|
puts "::::row #{i} row=#{row}"
end
Another alternative is CSV#lineno:
lineno()
The line number of the last row read from this file. Fields with nested line-end characters will not affect this count.
You would use it like this:
csv = CSV.new($stdin, CSV_OPTIONS)
csv.each do |row|
puts "::::row #{csv.lineno} row=#{row}"
end
Note that each_with_index will start counting at 0, whereas lineno starts at 1.
You can see both approaches in action on repl.it: https://repl.it/#jrunning/LoudBlushingCharactercode
I have a CSV file. I checked its encoding using this:
File.open('C:\path\to\file\myfile.txt').read.encoding
and it returned the encoding as:
=> #<Encoding:IBM437>
I'm reading this CSV per row -- stripping spaces and doing other stuff. After "cleansing" it, I push it to a new file. I'm doing it like this:
CSV.foreach(file_read, encoding: "IBM437:UTF-8") do |r|
# some code
CSV.open(file_appended, "a", col_sep: "|") do |csv|
csv << r
end
end
Now my problem is, inside the CSV I'm reading, there's a word with an accented character -- Ñ to be exact. This character is being appended to the new file as
\u2564
Its a problem considering that the accented character is a vital part of that word, and I wanted that character to appear to the new file as-is.
Am I missing something? I tried the ff. source:destination encoding but to no avail:
ISO-8859-1:UTF8 (and vice versa)
ISO-8859-1:Windows-1252 (and vice versa)
Am I missing something?
Here is my ruby version, just if you'd need to know:
ruby 1.9.3p392 (2013-02-22) [i386-mingw32]
Thanks in advance!
The line below solved my problem:
Encoding.default_external = "iso-8859-1"
It tells Ruby that the file being read is encoded in ISO-8859-1, and therefore correctly interprets the Ñ character.
Credit goes to Darshan Computing's answer here. Just look for Update #2.
My computer has no idea what this character is. It came from Excel.
In excel it was a weird space, now it is literally represented by several symbols viz. my computer has no idea what it is.
This character is represented by a Ê in Excel (in csv, as xls it is a space of some kind), OS X's TextEdit treats it as a big space this long " ", which is, I think, what it is. Ruby's CSV parser blows up when it tries to parse it using normal utf-8, and I have to add :encoding => "windows-1251:utf-8" to parse it, in which case Ruby turns it into an "K". This K appears in groups of 9, 12, 15 and 18 (KKKKKKKKK, etc) in my CSV, and cannot be removed via gsub(/K/) (groups of K, /KKKKKKKKK/, etc, cannot be removed either)! I've also used the opensource tool CSVfix, but its "removing leading and trailing spaces" command did not have an effect on the Ks.
I've tried using sed as suggested in Remove non-ascii characters from csv, but got errors like
sed: 1: "output.csv": invalid command code o
when running something like sed -i 's/[\d128-\d255]//' input.csv on Mac.
Parse your csv with the following to remove your "evil" character
.encode!("ISO-8859-1", :invalid => :replace)
**self-answers (different account, same person)
1st solution attempt:
evil_string_from_csv_cell = "KKKKKKKKK"
encoding_opts = {
:invalid => :replace, :undef => :replace,
:replace => '', :universal_newline => true }
evil_string_from_csv_cell.encode Encoding.find('ASCII'), encoding_opts
#=> ""
2nd solution attempt:
Don't use 'windows-1251:utf-8' for encoding, use 'iso-8859-1' instead, which will turn those (cyrillic) K's into '\xCA', which can then be removed with
string.gsub!(/\xCA/, '')
** I have not solved this problem yet.
3rd solution attempt:
trying to match array of K's as if they were actual K's is foolish. Copy and paste in the actual cyrillic K and see how that works-- here is the character, notice the little curl on the end
К
ruby treats it by making it a little bit bolder than normal K's
4th solution/strategy attempt (success):
use regular expressions to capture the characters, so long as you can encode the weird spaces (or whatever they are) into something, you can then ignore them using regular expressions
also try to take advantage of any spatial (matrix-like) patterns amongst the document types.
The answer to this problem is
A.) this is a very difficult problem. no one so far knows how to "physically" remove the cyrillic Ks.
but
B.) csv files are just strings separated by unescaped commas, so matching strings using regular expressions works just find so long as the encoding doesn't break the program.
So to read the file
f = File.open(File.join(Rails.root, 'lib', 'assets', 'repo', name), :encoding => "windows-1251:utf-8")
parsed = CSV.parse(f)
then find specific rows via regular expression literal string matching (it will overlook the cyrillic K's)
parsed.each do |p| #here, p[0] is the metatag column
#specific_metatag_row = parsed.index if p[0] =~ /MetatagA/
end
I couldn't get sed working but finally had luck doing this in Vim:
vim myhorriblefile.csv
# Once vim is open:
:s/Ê/ /g
:wq
# Done!
As a generalized function for reuse, this can be:
clean_weird_character () {
vim "$1" -c ":%s/Ê/ /g" -c "wq"
}
I frequently deal with UTF-16LE files encoded on windows which have a \r\n carriage return. There is no problem converting the file to UTF-8 by using:
File.new(filepath, 'r:utf-16le:utf-8')
But this of course does not get rid of the \r. The way I currently get rid of them is with
str.gsub("\r", "")
But it would be nice to take care of it while reading the file in. String#encode has :cr_newline, :crlf_newline, and :universal_newline options which convert all newlines to a desired kind of newline. Is there a way to apply these or similar options while reading in a file?
The method IO#gets takes an optional argument that allows you to pass a string to define how to separate the lines:
file = File.new(filepath, 'r:utf-16le:utf-8')
while (line = file.gets("\r\n"))
...
end
I'm using Ruby's CSV library to parse some CSV. I have a seemingly well-formed CSV file that I created by exporting an Excel file as CSV.
However CSV.open(filename, 'r') causes a CSV::IllegalFormatError.
There are no rogue commas or quotation marks in the file, nor anything else that I can see that might cause problems.
I suspect the problem could be to do with line endings. I am able to parse data entered manually via a text editor (Aquamacs). It is just when I try with data exported from Excel (for OS X) that problems occur. When I open up the exported CSV in vim, all the text appears on one line, with ^M appearing between lines.
From the docs, it seems that you can provide open with a row separator; however I am unsure what it should be in this case.
Try: CSV.open('filename', 'r', ?,, ?\r)
As cantlin notes, for Ruby 2 it's:
CSV.new('file.csv', 'r', :col_sep => ?,, :row_sep => ?\r)
I'm pretty sure these will DTRT for you. You can also "fix" the file itself (in which case keep the old open) with the following vim command: :%s/\r/\r/g
Yes, I know that command looks like a total no-op, but it will work.
Stripping \r characters seemed to work for me
CSV.parse(File.read('filename').gsub(/\r/, '')) do |row|
...
end
Another option is to open the CSV file or the original spreadsheet in Excel and save it as "Windows Comma Separated" rather than "Comma Separated Values". This will output the file with line endings that FasterCSV is able to understand.
"""
When I open up the exported CSV in vim, all the text appears on one line, with ^M appearing between lines.
From the docs, it seems that you can provide open with a row separator; however I am unsure what it should be in this case.
"""
Read back a sentence ... ^M means keyboard Ctrl-M aka '\x0D' (M is the 13th letter of the ASCII alphabet; 0x0D == 13) aka ASCII CR (carriage return) aka '\r' ... IOW what Macs used to use as a line terminator before OS X.
It seems newer versions of the CSV parser and/or any component it uses read DOS/Windows line endings without issues. Mac OS X's stock one (not sure the version) was not cutting it, installed Ruby 2.0.0 and it parsed the file just fine, without the special arguments...
I had similar problem. I got an error:
"error_message"=>"Illegal quoting in line 1.", "error_class"=>"CSV::MalformedCSVError"
The problem was the file had Windows line endings, which are of course other than Unix. What helped me was defining row_sep: "\r\n":
CSV.open(path, 'w', headers: :first_row, col_sep: ';', quote_char: '"', row_sep: "\r\n")