How can I read CSV with strange quoting in ruby? - ruby

I have CSV file with some line like:
col1,col "two",col3
so i get Illegal quoting error and fix that by setting :quote_char => "\x00"
["col1", "col\"two\"", "col3"]
but there is a line like
col1,col2,"col,3"
later in that file
["col1", "col2", "\"col", "3\""]
then i read file line by line and call parse_csv wrapped in block. Set :quote_char => "\"", rescue CSV::MalformedCSVError exceptions and for that particular lines set :quote_char => "\x00" and retry
All works perfectly until we get line
col1,col "two","col,3"
in this case it rescues from exception, set :quote_char => "\x00" and result is
["col1", "col\"two\"", "\"col", "3\""]
Apple Numbers is able to openn that file absolutely correctly.
Is there are any setting for parse_csv to handle this without preprocess string in some way?
UPD i show CSV lines as it is in file and results (arrays) as it was printed by p. there are no actual \" in my strings.

This is an invalid csv file. If you have access to the source, you could (ask to) generate the data as follows:
col1,"col ""two""","col,3"
If not, the only option is to parse the data yourself:
pseudocode:
while(read_line) {
bool InsideQuotes = false
for each_char_in_line {
if(char == doublequote)
InsideQuotes = !InsideQuotes
if(char == ',' and !InsideQuotes)
// separator found - process field
}
}
This will also take care of escaped quotes like in col1,"col ""two""","col,3".
If the file contains multiline fields, some more work has to be done.

CSV is less a standard and more of a name that everyone thinks they're using to describe their quirky format correctly, and this is despite their being an RFC standard for CSV which is just another thing nobody pays attention to.
As such, a lot of programs that read CSV are very forgiving. Ruby's core CSV library is pretty good, but not as adaptable as others. That's because you've got Ruby there to get you out of a jam, and in Numbers you don't.
Try rewriting \" to "" which is conventional CSV formatting, as defined in the spec linked above:
CSV.parse(File.read.gsub(/\\"/, '""'))

Related

Parse CSV file with headers when the headers are part way down the page

I have a CSV file that, as a spreadsheet, looks like this:
I want to parse the spreadsheet with the headers at row 19. Those headers wont always start at row 19, so my question is, is there a simple way to parse this spreadsheet, and specify which row holds the headers, say by using the "Date" string to identify the header row?
Right now, I'm doing this:
CSV.foreach(params['logbook'].tempfile, headers: true) do |row|
Flight.create(row.to_hash)
end
but obviously that wont work because it doesn't get the right headers.
I feel like there should be a simple solution to this since it's pretty common to have CSV files in this format.
Let's first create the csv file that would be produced from the spreadsheet.
csv =<<-_
N211E,C172,2004,Cessna,172R,airplane,airplane
C-GPGT,C172,1976,Cessna,172M,airplane,airplane
N17AV,P28A,1983,Piper,PA-28-181,airplane,airplane
N4508X,P28A,1975,Piper,PA-28-181,airplane,airplane
,,,,,,
Flights Table,,,,,,
Date,AircraftID,From,To,Route,TimeOut,TimeIn
2017-07-27,N17AV,KHPN,KHPN,KHPN KHPN,17:26,18:08
2017-07-27,N17AV,KHSE,KFFA,,16:29,17:25
2017-07-27,N17AV,W41,KHPN,,21:45,23:53
_
FName = 'test.csv'
File1.write(FName, csv)
#=> 395
We only want the part of the string that begins "Date,".The easiest option is probably to first extract the relevant text. If the file is not humongous, we can slurp it into a string and then remove the unwanted bit.
str = File.read(FName).gsub(/\A.+?(?=^Date,)/m, '')
#=> "Date,AircraftID,From,To,Route,TimeOut,TimeIn\n2017-07-27,N17AV,
# KHPN,KHPN,KHPN KHPN,17:26,18:08\n2017-07-27,N17AV,KHSE,KFFA,,16:29,
# 17:25\n2017-07-27,N17AV,W41,KHPN,,21:45,23:53\n"
The regular expression that is gsub's first argument could be written in free-spacing mode, which makes it self-documenting:
/
\A # match the beginning of the string
.+? # match any number of characters, lazily
(?=^Date,) # match "Date," at the beginning of a line in a positive lookahead
/mx # multi-line and free-spacing regex definition modes
Now that we have the part of the file we want in the string str, we can use CSV::parse to create the CSV::Table object:
csv_tbl = CSV.parse(str, headers: true)
#=> #<CSV::Table mode:col_or_row row_count:4>
The option :headers => true is documented in CSV::new.
Here are a couple of examples of how csv_tbl can be used.
csv_tbl.each { |row| p row }
#=> #<CSV::Row "Date":"2017-07-27" "AircraftID":"N17AV" "From":"KHPN"\
# "To":"KHPN" "Route":"KHPN KHPN" "TimeOut":"17:26" "TimeIn":"18:08">
# #<CSV::Row "Date":"2017-07-27" "AircraftID":"N17AV" "From":"KHSE"\
# "To":"KFFA" "Route":nil "TimeOut":"16:29" "TimeIn":"17:25">
# #<CSV::Row "Date":"2017-07-27" "AircraftID":"N17AV" "From":"W41"\
# "To":"KHPN" "Route":nil "TimeOut":"21:45" "TimeIn":"23:53">
(I've used the character '\' to signify that the string continues on the following line, so that readers would not have to scroll horizontally to read the lines.)
csv_tbl.each { |row| p row["From"] }
# "KHPN"
# "KHSE"
# "W41"
Readers who want to know more about how Ruby's CSV class is used may wish to read Darko Gjorgjievski's piece, "A Guide to the Ruby CSV Library, Part 1 and Part 2".
You can use the smarter_csv gem for this. Parse the file once to determine how many rows you need to skip to get to the header row you want, and then use the skip_lines option:
header_offset = <code to determine number of lines above the header>
SmarterCSV.process(params['logbook'].tempfile, skip_lines: header_offset)
From this format, I think the easiest way is to detect an empty line that comes before the header line. That would also work under changes to the header text. In terms of CSV, that would mean a whole line that has only empty cell items.

Escaping Strings For Ruby SQLite Insert

I'm creating a Ruby script to import a tab-delimited text file of about 150k lines into SQLite. Here it is so far:
require 'sqlite3'
file = File.new("/Users/michael/catalog.txt")
string = []
# Escape single quotes, remove newline, split on tabs,
# wrap each item in quotes, and join with commas
def prepare_for_insert(s)
s.gsub(/'/,"\\\\'").chomp.split(/\t/).map {|str| "'#{str}'"}.join(", ")
end
file.each_line do |line|
string << prepare_for_insert(line)
end
database = SQLite3::Database.new("/Users/michael/catalog.db")
# Insert each string into the database
string.each do |str|
database.execute( "INSERT INTO CATALOG VALUES (#{str})")
end
The script errors out on the first line containing a single quote in spite of the gsub to escape single quotes in my prepare_for_insert method:
/Users/michael/.rvm/gems/ruby-1.9.3-p0/gems/sqlite3-1.3.5/lib/sqlite3/database.rb:91:
in `initialize': near "s": syntax error (SQLite3::SQLException)
It's erroring out on line 15. If I inspect that line with puts string[14], I can see where it's showing the error near "s". It looks like this: 'Touch the Top of the World: A Blind Man\'s Journey to Climb Farther Than the Eye Can See'
Looks like the single quote is escaped, so why am I still getting the error?
Don't do it like that at all, string interpolation and SQL tend to be a bad combination. Use a prepared statement instead and let the driver deal with quoting and escaping:
# Ditch the gsub in prepare_for_insert and...
db = SQLite3::Database.new('/Users/michael/catalog.db')
ins = db.prepare('insert into catalog (column_name) values (?)')
string.each { |s| ins.execute(s) }
You should replace column_name with the real column name of course; you don't have to specify the column names in an INSERT but you should always do it anyway. If you need to insert more columns then add more placeholders and arguments to ins.execute.
Using prepare and execute should be faster, safer, easier, and it won't make you feel like you're writing PHP in 1999.
Also, you should use the standard CSV parser to parse your tab-separated files, XSV formats aren't much fun to deal with (they're downright evil in fact) and you have better things to do with your time than deal with their nonsense and edge cases and what not.

Ruby,Rhomobile,JqueryMobile and Single Quote

In rhomobile, which is on ruby I have a parsing of file and saving to sqlite db such a code
Questions.delete_all()
file_name = File.join(Rho::RhoApplication::get_model_path('app','Settings'), 'questions.txt')
file = File.new(file_name)
file.each_line("\n") do |row|
col = row.split("|")
#question=Questions.create(
{"id" => col[0], "question" => col[1],"answered"=>'0',"show"=>'1',"tutorial"=>col[4]}
)
break if file.lineno > 1500
end
file.close
when in text in string there is single quote aka ' , for example an expression
It's funny
Then after parsing, saving and populating I get
It�s funny
Any idea how to solve this and where from it comes, from Ruby, From sqlite or from what? how to solve it?
I would check to make sure that your parsing isn't doing something funny. The Rhodes handles all of the necessary escaping in its ORM. I've never had any issues with quotes in the db.

when we import csv data, how eliminate "invalid byte sequence in UTF-8"

we allow users to import data via csv (using ruby 1.9.2, hence it's fastercsv).
being user data, of course, it might not be properly sanitized.
When we try to display the data in an /index method we sometimes get the error "invalid byte sequence in UTF-8" pointing to our erb where we display one of the fields widget.name
When we do the import we'd like to FORCE the incoming data to be valid... is there a ruby operator that will map a string to a valid utf8 string, eg, something like
goodstring = badstring.no_more_invalid_bytes
One example of 'bad' data is char that looks like a hyphen but is not a regular ascii hyphen. We'd prefer to map the non-utf-8 chars to a reasonable ascii equivalent (umlat-u going to u for exmaple) BUT we're okay with simply stripping the character to.
since this is when importing lots of data, it needs to be a fast built-in operator, hopefully...
Note: here is an example of the data. The file comes form windows and is 8bit ascii. when we import it and in our erb we display widget.name.inspect (instead of widget.name) we get:
"Chains \x96 Accessories"
so one example of the data is a "hyphen" that's actually 8 bit code 96.
--- when we changed our csv parse to assign fldval = d.encode('UTF-8')
it throws this error:
Encoding::UndefinedConversionError in StoresController#importfinderitems
"\x96" from ASCII-8BIT to UTF-8
what we're looking for is a simple way to just force it to be valid utf8 regardless of origin type, even if we simply strip non-ascii.
while not as 'nice' as forcing the encoding, this works at a slight expense to our import time:
d.to_s.strip.gsub(/\P{ASCII}/, '')
Thank you, Mladen!
Ruby 1.9 CSV has new parser that works with m17n. The parser works with Encoding of IO object in the string. Following methods: ::foreach, ::open, ::read, and ::readlines could take in optional options :encoding which you could specify the the Encoding.
For example:
CSV.read('/path/to/file', :encoding => 'windows-1251:utf-8')
Would convert all strings to UTF-8.
Also you can use the more standard encoding name 'ISO-8859-1'
CSV.read('/..', {:headers => true, :col_sep => ';', :encoding => 'ISO-8859-1'})
CSV.parse(File.read('/path/to/csv').scrub)
I answered a similar question that deals with reading external files in 1.9.2 with non-UTF-8 encodings. I think that answer will help you a lot: Character Encoding issue in Rails v3/Ruby 1.9.2
Note that you need to know the source encoding for you to convert it anything reliably. There are libraries like the one I linked to in my other answer that can help you determine this.
Also, if you aren't loading the data from a file, you can convert the encoding of a string in 1.9.2 quite easily:
'string'.encode('UTF-8')
However, it's rare that you're building a string in another encoding, and it's best to convert it at the time it's read into your environment if possible.
Ruby 1.9 can change string encoding with invalid detection and replacement:
str = str.encode('UTF-8', :invalid => :replace)
For unusual strings such as strings loaded from a file of unknown encoding, it's wise to use #encode instead of a regex, #gsub, or #delete, because these all need the string to be parsed-- but if the string is broken, it can't be parsed, so those methods fail.
If you get a message like this:
error ** from ASCII-8BIT to UTF-8
Then you're probably trying to convert a binary string that's already in UTF-8, and you can force UTF-8:
str.force_encoding('UTF-8')
If you know the original string is not in binary UTF-8, or if the output string has illiegal characters, then read up on Ruby encoding transliterations.
If you are using Rails, you can try to fix it with the following
'Your string with strange stuff ##~'.mb_chars.tidy_bytes
It removes you the invalid utf-8 chars and replaces it with valid ones.
More info: https://apidock.com/rails/String/mb_chars
Upload the CSV file to Google Docs Spreadsheet and re-download it as a CSV file. Import and voila! (Worked in my case)
Presumably Google converts it to the wanted format..
Source: Excel to CSV with UTF-8 Encoding
As mentioned by someone else, scrub works well to clean this up in Ruby 2.1+. If you have a large file you may not want to read the whole thing into memory, so you can use scrub like this:
data = IO::read(file_path).scrub("")
CSV.parse(data, :col_sep => ',', :headers => true) do |row|
puts row
end
I am using MAC and I was having the same error:
rescue in parse:Invalid byte sequence in UTF-8 in line 1 (CSV::MalformedCSVError)
I added :encoding => 'ISO-8859-1' that resolved my error and csv file could be read.
results = CSV.read("query_result.csv",{:headers => true, :encoding => 'ISO-8859-1'})
:headers => true : If set to :first_row or true, the initial row of the CSV file will be treated as a row of headers. If set to an Array, the contents will be used as the headers. If set to a String, the String is run through a call of ::parse_line with the same :col_sep, :row_sep, and :quote_char as this instance to produce an Array of headers. This setting causes #shift to return rows as CSV::Row objects instead of Arrays and #read to return CSV::Table objects instead of an Array of Arrays.
irb(main):024:0> rows = CSV.new(StringIO.new("a,b,c\n1,2,3"), headers: true)
=> <#CSV io_type:StringIO encoding:UTF-8 lineno:0 col_sep:"," row_sep:"\n" quote_char:"\"" headers:true>
irb(main):025:0> rows = CSV.new(StringIO.new("a,b,c\n1,2,3"), headers: true).to_a
=> [#<CSV::Row "a":"1" "b":"2" "c":"3">]
irb(main):026:0> rows.first['a']
=> "1"
In above example you can clearly see that this also enables us to use data as hashes.
The only thing you would need to be careful about while using headers: true that it won't allow any duplicate headers as keys are unique in hashes.
Only do this
anyobject.to_csv(:encoding => 'utf-8')

Ruby string encoding problem

I've looked at the other ruby/encoding related posts but haven't been able to figure out why the following is not working. Likely just because I'm dense, but here's the situation.
Using Ruby 1.9 on windows. I have a set of CSV files that need some data appended to the end of each line. Whenever I run my script, the appended characters are gibberish. The input text appears to be IBM437 encoding, whereas my string I'm appending starts as US-ASCII. Nothing I've tried with respect to forcing encoding on the input strings or the append string seems to change the resultant output. I'm stumped. The current encoding version is simply the last that I tried.
def append_salesperson(txt, salesperson)
if txt.length > 2
return txt.chomp.force_encoding('US-ASCII') + %(, "", "", "#{salesperson}")
end
end
salespeople = Hash[
"fname", "Record Manager"]
outfile = File.open("ActData.csv", "w:US-ASCII")
salespeople.each do | filename, recordManager |
infile = File.open("#{filename}.txt")
infile.each do |line|
outfile.puts append_salesperson(line, recordManager)
end
infile.close
end
outfile.close
One small note that is related to your question is that you have your csv data as such %(, "", "", "#{salesperson}"). Here you have a space char before your double quotes. This can cause the #{salesperson} to be interpreted as multiple fields if there is a comma in this text. To fix this there can't be white space between the comma and the double quotes. Example: "this is a field","Last, First","and so on". This is one little gotcha that I ran into when creating reports meant to be viewed in Excel.
In Common Format and MIME Type for Comma-Separated Values (CSV) Files they describe the grammar of a csv file for reference.
maybe txt.chomp.force_encoding('US-ASCII') + %(, "", "", "#{salesperson.force_encoding('something')}")
?
It sounds like the CSV data is coming in as UTF-16... hence the puts shows as the printable character (the first byte) plus a space (the second byte).
Have you tried encoding your appended data with .force_encoding(Encoding::UTF-16LE) or .force_encoding(Encoding::UTF-16BE)?

Resources