utf8 character in csv file - ruby

I am trying to create a database using mongoid but Mongo is creating the database, I got an issue on encoding the database in utf8
extract_data class:
class ExtractData
include Mongoid::Document
include Mongoid::Timestamps
def self.create_all_databases
#cbsa2msa = DbForCsv.import!('./share/private/csv/cbsa_to_msa.csv')
#zip2cbsa = DbForCsv.import!('./share/private/csv/zip_to_cbsa.csv')
end
def self.show_all_database
ap #cbsa2msa.all.to_a
ap #zip2cbsa.all.to_a
end
end
the class DbForCSV works as below:
class DbForCsv
include Mongoid::Document
include Mongoid::Timestamps
include Mongoid::Attributes::Dynamic
def self.import!(file_path)
columns = []
instances = []
CSV.foreach(file_path, encoding: 'iso-8859-1:UTF-8') do |row|
if columns.empty?
# We dont want attributes with whitespaces
columns = row.collect { |c| c.downcase.gsub(' ', '_') }
next
end
instances << create!(build_attributes(row, columns))
end
instances
end
private
def self.build_attributes(row, columns)
attrs = {}
columns.each_with_index do |column, index|
attrs[column] = row[index]
end
ap attrs
attrs
end
end
I am using the encoding to make sure only UTF8 char are handled but I still see:
{
"zip" => "71964",
"cbsa" => "31680",
"res_ratio" => "0.086511098",
"bus_ratio" => "0.012048193",
"oth_ratio" => "0.000000000",
"tot_ratio" => "0.082435345"
}
when doing 'ap attrs' in the code. how to make sure that 'zip' -> 'zip'
I have tried also :
columns = row.collect { |c| c.encode(Encoding.find('UTF-8'), {invalid: :replace, undef: :replace, replace: ''}).downcase.gsub(' ', '_')}
but same things
ArgumentError - invalid byte sequence in UTF-8
here is the file csv file
Thanks

If I take the word you read in from the csv file:
zip
and paste it into a hex editor, it reveals that the word consists of the bytes:
z i p
| | |
V V V
C3 AF C2 BB C2 BF 7A 69 70
So what is that junk in front of "zip"?
The UTF-8 BOM is the string:
"\xEF\xBB\xBF"
If I force the encoding of the BOM string (which is UTF-8 by default in a ruby program) to iso-8859-1:
"\xEF\xBB\xBF".force_encoding("ISO-8859-1")
then look at an iso-8859-1 chart for those hex codes, I find:
EF => ï
BB => »
BF => ¿
Next, if I encode the BOM string to UTF-8:
"\xEF\xBB\xBF".force_encoding("ISO-8859-1").encode("UTF-8")
that asks ruby to replace the hex escapes in the string with the hex escapes for the same characters in the UTF-8 encoding, which are:
ï c3 af LATIN SMALL LETTER I WITH DIAERESIS
» c2 bb RIGHT-POINTING DOUBLE ANGLE QUOTATION MARK
¿ c2 bf INVERTED QUESTION MARK
giving me:
"\xC3\xAF\xC2\xBB\xC2\xBF"
Removing ruby's hex escape syntax gives me:
C3 AF C2 BB C2 BF
Compare that to what the hex editor revealed:
z i p
| | |
V V V
C3 AF C2 BB C2 BF 7A 69 70
Look familiar?
You are asking ruby to do the same thing as above when you write this:
CSV.foreach(file_path, encoding: 'iso-8859-1:UTF-8')
In the file, you have a UTF-8 BOM at the beginning of the file:
"\xEF\xBB\xBF"
But, you tell ruby that the file is encoded in ISO-8859-1 and that you want ruby to convert the file to UTF-8 strings inside your ruby program:
external encoding
|
V
CSV.foreach(file_path, encoding: 'iso-8859-1:UTF-8')
^
|
internal encoding
Therefore ruby goes through the same process as described above to produce a string inside your ruby program that looks like the following:
"\xC3\xAF\xC2\xBB\xC2\xBF"
which screws up your first row of CSV data. You said:
I am using the encoding to make sure only UTF8 char are handled
but that doesn't make any sense to me. If the file is UTF-8, then tell ruby that the external encoding is UTF-8:
external encoding
|
V
CSV.foreach(file_path, encoding: 'UTF-8:UTF-8')
^
|
internal encoding
Ruby does not automatically skip the BOM when reading a file, so you will still get funny characters at the start of your first row. To fix that, you can use the external encoding 'BOM|UTF-8', which tells ruby to use a BOM if present to determine the external encoding, then skip over the BOM; or if no BOM is present, then use 'UTF-8' as the external encoding:
external encoding
|
V
CSV.foreach(file_path, encoding: 'BOM|UTF-8:UTF-8')
^
|
internal encoding
That encoding works fine with CSV.foreach(), and it will cause CSV to skip over the BOM after CSV determines the file's encoding.
Response to comment:
The file you posted isn't UTF-8 and there is no BOM. When you specify the external encoding as "BOM|UTF-8" and there is no BOM, you are telling CSV to fall back to an external encoding of UTF-8, and CSV errors out on this row:
"Doña Ana County"
The character ñ is specified as F1 in the file, which is the ISO-8559-1 hex code for the character ñ, and there is no random UTF-8 character with the encoding F1 (in UTF-8 the hex code for LATIN SMALL LETTER N WITH TILDE is actually C3 B1).
If you change the external encoding to "ISO-8859-1" and you specify the internal encoding as "UTF-8", then CSV will process the file without error, and CSV will convert the F1 read from the file to C3 B1 and hand your program UTF-8 encoded strings. The bottom line is: you have to know the encoding of a file to read it. If you are reading many files and they all have different encodings, then you have to know the encoding of each file before you can read it. If you are certain all your files are either ISO-8859-1 or UTF-8, then you can try reading the file with one encoding, and if CSV errors out, you can catch the encoding error and try the other encoding.

You have a BOM-marked UTF-8 file. Ruby is able to automatically skip BOM with a certain mode option to IO::new (and also IO::open and CSV::open). Unfortunately, AFAIK you can't make CSV::foreach pass a mode parameter, so we have to be a bit more verbose.
Replace your
CSV.foreach(file_path, encoding: 'iso-8859-1:UTF-8') do |row|
# ...
end
with
CSV::open(file_path, 'r:bom|utf-8') do |csv|
csv.each do |row|
# ...
end
end
EDIT: There is also a better way to read hashes from CSV with headers (no need for if...next, no need for build_attributes...), since Ruby's CSV library is quite powerful:
instances = CSV::open(file_path, 'r:bom|utf-8',
headers: true,
header_converters: lambda { |c| c.downcase.gsub(' ', '_') }) do |csv|
csv.map do |row|
create!(row.to_hash)
end
end

Related

`scan': invalid byte sequence in UTF-8 (ArgumentError)

I'm trying to read a .txt file in ruby and split the text line-by-line.
Here is my code:
def file_read(filename)
File.open(filename, 'r').read
end
puts f = file_read('alice_in_wonderland.txt')
This works perfectly. But when I add the method line_cutter like this:
def file_read(filename)
File.open(filename, 'r').read
end
def line_cutter(file)
file.scan(/\w/)
end
puts f = line_cutter(file_read('alice_in_wonderland.txt'))
I get an error:
`scan': invalid byte sequence in UTF-8 (ArgumentError)
I found this online for untrusted website and tried to use it for my own code but it's not working. How can I remove this error?
Link to the file: File
The linked text file contains the following line:
Character set encoding: ISO-8859-1
If converting it isn't desired or possible then you have to tell Ruby that this file is ISO-8859-1 encoded. Otherwise the default external encoding is used (UTF-8 in your case). A possible way to do that is:
s = File.read('alice_in_wonderland.txt', encoding: 'ISO-8859-1')
s.encoding # => #<Encoding:ISO-8859-1>
Or even like this if you prefer your string UTF-8 encoded (see utf8everywhere.org):
s = File.read('alice_in_wonderland.txt', encoding: 'ISO-8859-1:UTF-8')
s.encoding # => #<Encoding:UTF-8>
It seems to work if you read the file directly from the page, maybe there's something funny about the local copy you have. Try this:
require 'net/http'
uri = 'http://www.ccs.neu.edu/home/vip/teach/Algorithms/7_hash_RBtree_simpleDS/hw_hash_RBtree/alice_in_wonderland.txt'
scanned = Net::HTTP.get_response(URI.parse(uri)).body.scan(/\w/)

Ruby: parse yaml from ANSI to UTF-8

Problem:
I have the yaml file test.yml that can be encoded in UTF-8 or ANSI:
:excel:
"Test":
"eins_Ä": :eins
"zwei_ä": :zwei
When I load the file I need it to be encoded in UTF-8 therefore tried to convert all of the Strings:
require 'yaml'
file = YAML::load_file('C:/Users/S61256/Desktop/test.yml')
require 'iconv'
CONV = Iconv.new("UTF-8", "ASCII")
class Test
def convert(hash)
hash.each{ |key, value|
convert(value) if value.is_a? Hash
CONV.iconv(value) if value.is_a? String
CONV.iconv(key) if key.is_a? String
}
end
end
t = Test.new
converted = t.convert(file)
p file
p converted
But when I try to run this example script it prints:
in 'iconv': eins_- (Iconv:IllegalSequence)
Questions:
1. Why does the error show up and how can I solve it?
2. Is there another (more appropiate) way to get the file's content in UTF-8?
Note:
I need this code to be compatible to Ruby 1.8 as well as Ruby 2.2. For Ruby 2.2 I would replace all the Iconv stuff with String::encode, but that's another topic.
The easiest way to deal with wrong encoded files is to read it in its original encoding, convert to UTF-8 and then pass to receiver (YAML in this case):
▶ YAML.load File.read('/tmp/q.yml', encoding: 'ISO-8859-1').force_encoding 'UTF-8'
#⇒ {:excel=>{"Test"=>{"eins_Ä"=>:eins, "zwei_ä"=>:zwei}}}
For Ruby 1.8 you should probably use Iconv, but the whole process (read as is, than encode, than yaml-load) remains the same.

Ruby converting string encoding from ISO-8859-1 to UTF-8 not working

I am trying to convert a string from ISO-8859-1 encoding to UTF-8 but I can't seem to get it work. Here is an example of what I have done in irb.
irb(main):050:0> string = 'Norrlandsvägen'
=> "Norrlandsvägen"
irb(main):051:0> string.force_encoding('iso-8859-1')
=> "Norrlandsv\xC3\xA4gen"
irb(main):052:0> string = string.encode('utf-8')
=> "Norrlandsvägen"
I am not sure why Norrlandsvägen in iso-8859-1 will be converted into Norrlandsvägen in utf-8.
I have tried encode, encode!, encode(destinationEncoding, originalEncoding), iconv, force_encoding, and all kinds of weird work-arounds I could think of but nothing seems to work. Can someone please help me/point me in the right direction?
Ruby newbie still pulling hair like crazy but feeling grateful for all the replies here... :)
Background of this question: I am writing a gem that will download an xml file from some websites (which will have iso-8859-1 encoding) and save it in a storage and I would like to convert it to utf-8 first. But words like Norrlandsvägen keep messing me up. Really any help would be greatly appreciated!
[UPDATE]: I realized running tests like this in the irb console might give me different behaviors so here is what I have in my actual code:
def convert_encoding(string, originalEncoding)
puts "#{string.encoding}" # ASCII-8BIT
string.encode(originalEncoding)
puts "#{string.encoding}" # still ASCII-8BIT
string.encode!('utf-8')
end
but the last line gives me the following error:
Encoding::UndefinedConversionError - "\xC3" from ASCII-8BIT to UTF-8
Thanks to #Amadan's answer below, I noticed that \xC3 actually shows up in irb if you run:
irb(main):001:0> string = 'ä'
=> "ä"
irb(main):002:0> string.force_encoding('iso-8859-1')
=> "\xC3\xA4"
I have also tried to assign a new variable to the result of string.encode(originalEncoding) but got an even weirder error:
newString = string.encode(originalEncoding)
puts "#{newString.encoding}" # can't even get to this line...
newString.encode!('utf-8')
and the error is Encoding::UndefinedConversionError - "\xC3" to UTF-8 in conversion from ASCII-8BIT to UTF-8 to ISO-8859-1
I am still quite lost in all of this encoding mess but I am really grateful for all the replies and help everyone has given me! Thanks a ton! :)
You assign a string, in UTF-8. It contains ä. UTF-8 represents ä with two bytes.
string = 'ä'
string.encoding
# => #<Encoding:UTF-8>
string.length
# 1
string.bytes
# [195, 164]
Then you force the bytes to be interpreted as if they were ISO-8859-1, without actually changing the underlying representation. This does not contain ä any more. It contains two characters, Ã and ¤.
string.force_encoding('iso-8859-1')
# => "\xC3\xA4"
string.length
# 2
string.bytes
# [195, 164]
Then you translate that into UTF-8. Since this is not reinterpretation but translation, you keep the two characters, but now encoded in UTF-8:
string = string.encode('utf-8')
# => "ä"
string.length
# 2
string.bytes
# [195, 131, 194, 164]
What you are missing is the fact that you originally don't have an ISO-8859-1 string, as you would from your Web-service - you have gibberish. Fortunately, this is all in your console tests; if you read the response of the website using the proper input encoding, it should all work okay.
For your console test, let's demonstrate that if you start with a proper ISO-8859-1 string, it all works:
string = 'Norrlandsvägen'.encode('iso-8859-1')
# => "Norrlandsv\xE4gen"
string = string.encode('utf-8')
# => "Norrlandsvägen"
EDIT For your specific problem, this should work:
require 'net/https'
uri = URI.parse("https://rusta.easycruit.com/intranet/careerbuilder_se/export/xml/full")
options = {
:use_ssl => uri.scheme == 'https',
:verify_mode => OpenSSL::SSL::VERIFY_NONE
}
response = Net::HTTP.start(uri.host, uri.port, options) do |https|
https.request(Net::HTTP::Get.new(uri.path))
end
body = response.body.force_encoding('ISO-8859-1').encode('UTF-8')
There's a difference between force_encoding and encode. The former sets the encoding for the string, whereas the latter actually transcodes the contents of the string to the new encoding. Consequently, the following code causes your problem:
string = "Norrlandsvägen"
string.force_encoding('iso-8859-1')
puts string.encode('utf-8') # Norrlandsvägen
Whereas the following code will actually correctly encode your contents:
string = "Norrlandsvägen".encode('iso-8859-1')
string.encode!('utf-8')
Here's an example running in irb:
irb(main):023:0> string = "Norrlandsvägen".encode('iso-8859-1')
=> "Norrlandsv\xE4gen"
irb(main):024:0> string.encoding
=> #<Encoding:ISO-8859-1>
irb(main):025:0> string.encode!('utf-8')
=> "Norrlandsvägen"
irb(main):026:0> string.encoding
=> #<Encoding:UTF-8>
The above answer was spot on. Specifically this point here:
There's a difference between force_encoding and encode. The former
sets the encoding for the string, whereas the latter actually
transcodes the contents of the string to the new encoding.
In my situation, I had a text file with iso-8859-1 encoding. By default, Ruby uses UTF-8 encoding, so if you were to try to read the file without specifying the encoding, then you would get an error:
results = File.read(file)
results.encoding
=> #<Encoding:UTF-8>
results.split("\r\n")
ArgumentError: invalid byte sequence in UTF-8
You get an invalid byte sequence error because the characters in different encodings are represented by different byte lengths. Consequently, you would need to specify the encoding to the File API. Think of it like force_encoding:
results = File.read(file, encoding: "iso-8859-1")
So everything is good right? No, not if you want to start parsing the iso-8859-1 string with UTF-8 character encodings:
results = File.read(file, encoding: "iso-8859-1")
results.each do |line|
puts line.split('¬')
end
Encoding::CompatibilityError: incompatible character encodings: ISO-8859-1 and UTF-8
Why this error? Because '¬' is represented as UTF-8. You are using a UTF-8 character sequence against an ISO-8859-1 string. They are incompatible encodings. Consequently, after you read the File as a ISO-8859-1, then you can ask Ruby to encode that ISO-8859-1 into a UTF-8. And now you will be working with UTF-8 strings and thus no problems:
results = File.read(file, encoding: "iso-8859-1").encode('UTF-8')
results.encoding
results = results.split("\r\n")
results.each do |line|
puts line.split('¬')
end
Ultimately, with some Ruby APIs, you do not need to use force_encoding('ISO-8859-1'). Instead, you just specify the expected encoding to the API. However, you must convert it back to UTF-8 if you plan to parse it with UTF-8 strings.

Ruby CSV + Spanish Characters: Encoding::UndefinedConversionError

I'm getting an error U+2014 from UTF-8 to ISO-8859-1 when I try to use send_data with a CSV that has spanish characters:
model:
def self.books_data(books)
csv = CSV.generate(:col_sep => "|", quote_char: '"') do |csv|
...
end
csv
end
controller:
def export_data
...
data = CsvGenerator.books_data(#books)
send_data(data.encode("iso-8859-1"), filename: "books_data_#{date}.csv", type: 'text/csv; charset=iso-8859-1; header=present') #<-- error occurs here
end
How would I fix this?
=== UPDATE ===
I think I semi-fixed it by replacing .encode with .force_encoding. However, I now have a lot of characters that don't look right:
Ex: The file contains:
My Diary from Here to There / Mi diario de aqui hasta allá
when it should look like
My Diary from Here to There / Mi diario de aqui hasta allá
String#force_encoding should never be used as it just "tags" string with different encoding, while #encode does actual conversion.
The reason you're getting this error because, somewhere in your data you have a \u2014 character: "—". As the String#encode documentation states:
raise Encoding::UndefinedConversionError for characters that are undefined in the destination encoding [...]
And if you check the iso map (http://en.wikipedia.org/wiki/ISO/IEC_8859-1), there is no "—"
character in 8859-1. So to solve this, you need to remove those "invalid" characters from data.
Besides that, unless there are some specific reasons, you should avoid such conversions, and let CSV to be generated in utf-8 encoding.
http://railscasts.com/episodes/362-exporting-csv-and-excel

Ruby read CSV file as UTF-8 and/or convert ASCII-8Bit encoding to UTF-8

I'm using ruby 1.9.2
I'm trying to parse a CSV file that contains some French words (e.g. spécifié) and place the contents in a MySQL database.
When I read the lines from the CSV file,
file_contents = CSV.read("csvfile.csv", col_sep: "$")
The elements come back as Strings that are ASCII-8BIT encoded (spécifié becomes sp\xE9cifi\xE9), and strings like "spécifié" are then NOT properly saved into my MySQL database.
Yehuda Katz says that ASCII-8BIT is really "binary" data meaning that CSV has no idea how to read the appropriate encoding.
So, if I try to make CSV force the encoding like this:
file_contents = CSV.read("csvfile.csv", col_sep: "$", encoding: "UTF-8")
I get the following error
ArgumentError: invalid byte sequence in UTF-8:
If I go back to my original ASCII-8BIT encoded Strings and examine the String that my CSV read as ASCII-8BIT, it looks like this "Non sp\xE9cifi\xE9" instead of "Non spécifié".
I can't convert "Non sp\xE9cifi\xE9" to "Non spécifié" by doing this
"Non sp\xE9cifi\xE9".encode("UTF-8")
because I get this error:
Encoding::UndefinedConversionError: "\xE9" from ASCII-8BIT to UTF-8,
which Katz indicated would happen because ASCII-8BIT isn't really a proper String "encoding".
Questions:
Can I get CSV to read my file in the appropriate encoding? If so, how?
How do I convert an ASCII-8BIT string to UTF-8 for proper storage in MySQL?
deceze is right, that is ISO8859-1 (AKA Latin-1) encoded text. Try this:
file_contents = CSV.read("csvfile.csv", col_sep: "$", encoding: "ISO8859-1")
And if that doesn't work, you can use Iconv to fix up the individual strings with something like this:
require 'iconv'
utf8_string = Iconv.iconv('utf-8', 'iso8859-1', latin1_string).first
If latin1_string is "Non sp\xE9cifi\xE9", then utf8_string will be "Non spécifié". Also, Iconv.iconv can unmangle whole arrays at a time:
utf8_strings = Iconv.iconv('utf-8', 'iso8859-1', *latin1_strings)
With newer Rubies, you can do things like this:
utf8_string = latin1_string.force_encoding('iso-8859-1').encode('utf-8')
where latin1_string thinks it is in ASCII-8BIT but is really in ISO-8859-1.
With ruby >= 1.9 you can use
file_contents = CSV.read("csvfile.csv", col_sep: "$", encoding: "ISO8859-1:utf-8")
The ISO8859-1:utf-8 is meaning: The csv-file is ISO8859-1 - encoded, but convert the content to utf-8
If you prefer a more verbose code, you can use:
file_contents = CSV.read("csvfile.csv", col_sep: "$",
external_encoding: "ISO8859-1",
internal_encoding: "utf-8"
)
I have been dealing with this issue for a while and not any of the other solutions worked for me.
The thing that made the trick was to store the conflictive string in a binary File, then read the File normally and using this string to feed the CSV module:
tempfile = Tempfile.new("conflictive_string")
tempfile.binmode
tempfile.write(conflictive_string)
tempfile.close
cleaned_string = File.read(tempfile.path)
File.delete(tempfile.path)
csv = CSV.new(cleaned_string)

Resources