How to avoid getting UndefinedConversionError during decoding a file? - ruby

I have somefile which I want to encode using base64
File.open('data/somefile.edf').read.encoding
=> #<Encoding:UTF-8>
base64_string = Base64.encode64(open("data/somefile.edf").to_a.join)
And then I want to decode that file
file = open('new_edf.edf', 'w') do |file|
file << Base64.decode64(base64_string)
end
But I get an error:
Encoding::UndefinedConversionError: "\xE1" from ASCII-8BIT to UTF-8
from (pry):22:in `write'

I believe the problem is that by default the file is opened for writing in text mode. To fix this, open the file using binary mode:
File.open('new_edf.edf', 'wb') { ... }
You can also check out this other question: Ruby 1.9 Base64 encoding write to file error

Related

`scan': invalid byte sequence in UTF-8 (ArgumentError)

I'm trying to read a .txt file in ruby and split the text line-by-line.
Here is my code:
def file_read(filename)
File.open(filename, 'r').read
end
puts f = file_read('alice_in_wonderland.txt')
This works perfectly. But when I add the method line_cutter like this:
def file_read(filename)
File.open(filename, 'r').read
end
def line_cutter(file)
file.scan(/\w/)
end
puts f = line_cutter(file_read('alice_in_wonderland.txt'))
I get an error:
`scan': invalid byte sequence in UTF-8 (ArgumentError)
I found this online for untrusted website and tried to use it for my own code but it's not working. How can I remove this error?
Link to the file: File
The linked text file contains the following line:
Character set encoding: ISO-8859-1
If converting it isn't desired or possible then you have to tell Ruby that this file is ISO-8859-1 encoded. Otherwise the default external encoding is used (UTF-8 in your case). A possible way to do that is:
s = File.read('alice_in_wonderland.txt', encoding: 'ISO-8859-1')
s.encoding # => #<Encoding:ISO-8859-1>
Or even like this if you prefer your string UTF-8 encoded (see utf8everywhere.org):
s = File.read('alice_in_wonderland.txt', encoding: 'ISO-8859-1:UTF-8')
s.encoding # => #<Encoding:UTF-8>
It seems to work if you read the file directly from the page, maybe there's something funny about the local copy you have. Try this:
require 'net/http'
uri = 'http://www.ccs.neu.edu/home/vip/teach/Algorithms/7_hash_RBtree_simpleDS/hw_hash_RBtree/alice_in_wonderland.txt'
scanned = Net::HTTP.get_response(URI.parse(uri)).body.scan(/\w/)

Ruby: parse yaml from ANSI to UTF-8

Problem:
I have the yaml file test.yml that can be encoded in UTF-8 or ANSI:
:excel:
"Test":
"eins_Ä": :eins
"zwei_ä": :zwei
When I load the file I need it to be encoded in UTF-8 therefore tried to convert all of the Strings:
require 'yaml'
file = YAML::load_file('C:/Users/S61256/Desktop/test.yml')
require 'iconv'
CONV = Iconv.new("UTF-8", "ASCII")
class Test
def convert(hash)
hash.each{ |key, value|
convert(value) if value.is_a? Hash
CONV.iconv(value) if value.is_a? String
CONV.iconv(key) if key.is_a? String
}
end
end
t = Test.new
converted = t.convert(file)
p file
p converted
But when I try to run this example script it prints:
in 'iconv': eins_- (Iconv:IllegalSequence)
Questions:
1. Why does the error show up and how can I solve it?
2. Is there another (more appropiate) way to get the file's content in UTF-8?
Note:
I need this code to be compatible to Ruby 1.8 as well as Ruby 2.2. For Ruby 2.2 I would replace all the Iconv stuff with String::encode, but that's another topic.
The easiest way to deal with wrong encoded files is to read it in its original encoding, convert to UTF-8 and then pass to receiver (YAML in this case):
▶ YAML.load File.read('/tmp/q.yml', encoding: 'ISO-8859-1').force_encoding 'UTF-8'
#⇒ {:excel=>{"Test"=>{"eins_Ä"=>:eins, "zwei_ä"=>:zwei}}}
For Ruby 1.8 you should probably use Iconv, but the whole process (read as is, than encode, than yaml-load) remains the same.

SmarterCSV and file encoding issues in Ruby

I'm working with a file that appears to have UTF-16LE encoding. If I run
File.read(file, :encoding => 'utf-16le')
the first line of the file is:
"<U+FEFF>=\"25/09/2013\"\t18:39:17\t=\"Unknown\"\t=\"+15168608203\"\t\"Message.\"\r\n
If I read the file using something like
csv_text = File.read(file, :encoding => 'utf-16le')
I get an error stating
ASCII incompatible encoding needs binmode (ArgumentError)
If I switch the encoding in the above to
csv_text = File.read(file, :encoding => 'utf-8')
I make it to the SmarterCSV section of the code, but get an error that states
`=~': invalid byte sequence in UTF-8 (ArgumentError)
The full code is below. If I run this in the Rails console, it works just fine, but if I run it using ruby test.rb, it gives me the first error:
require 'smarter_csv'
headers = ["date_of_message", "timestamp_of_message", "sender", "phone_number", "message"]
path = '/path/'
Dir.glob("#{path}*.CSV").each do |file|
csv_text = File.read(file, :encoding => 'utf-16le')
File.open('/tmp/tmp_file', 'w') { |tmp_file| tmp_file.write(csv_text) }
puts 'made it here'
SmarterCSV.process('/tmp/tmp_file', {
:col_sep => "\t",
:force_simple_split => true,
:headers_in_file => false,
:user_provided_headers => headers
}).each do |row|
converted_row = {}
converted_row[:date_of_message] = row[:date_of_message][2..-2].to_date
converted_row[:timestamp] = row[:timestamp]
converted_row[:sender] = row[:sender][2..-2]
converted_row[:phone_number] = row[:phone_number][2..-2]
converted_row[:message] = row[:message][1..-2]
converted_row[:room] = file.gsub(path, '')
end
end
Update - 05/13/15
Ultimately, I decided to encode the file string as UTF-8 rather than diving deeper into the SmarterCSV code. The first problem in the SmarterCSV code is that it does not allow a user to specify binary mode when reading in a file, but after adjusting the source to handle that, a myriad of other encoding-related issues popped-up, many of which related to the handling of various parameters on files that were not UTF-8 encoded. It may have been the easy way out, but encoding everything as UTF-8 before feeding it into SmarterCSV solved my issue.
Add binmode to the File.read call.
File.read(file, :encoding => 'utf-16le', mode: "rb")
"b" Binary file mode
Suppresses EOL <-> CRLF conversion on Windows. And
sets external encoding to ASCII-8BIT unless explicitly
specified.
ref: http://ruby-doc.org/core-2.0.0/IO.html#method-c-read
Now pass the correct encoding to SmarterCSV
SmarterCSV.process('/tmp/tmp_file', {
:file_encoding => "utf-16le", ...
Update
It was found that smartercsv does not support binary mode. After the OP attempted to modify the code with no success it was decided the simple solution was to convert the input to UTF-8 which smartercsv supports.
Unfortunately, you're using a 'flat-file' style of storage and character encoding is going to be an issue on both ends (reading or writing).
I would suggest using something along the lines of str = str.force_encoding("UTF-8") and see if you can get that to work.

Ruby CSV UTF8 encoding error while reading

This is what I was doing:
csv = CSV.open(file_name, "r")
I used this for testing:
line = csv.shift
while not line.nil?
puts line
line = csv.shift
end
And I ran into this:
ArgumentError: invalid byte sequence in UTF-8
I read the answer here and this is what I tried
csv = CSV.open(file_name, "r", encoding: "windows-1251:utf-8")
I ran into the following error:
Encoding::UndefinedConversionError: "\x98" to UTF-8 in conversion from Windows-1251 to UTF-8
Then I came across a Ruby gem - charlock_holmes. I figured I'd try using it to find the source encoding.
CharlockHolmes::EncodingDetector.detect(File.read(file_name))
=> {:type=>:text, :encoding=>"windows-1252", :confidence=>37, :language=>"fr"}
So I did this:
csv = CSV.open(file_name, "r", encoding: "windows-1252:utf-8")
And still got this:
Encoding::UndefinedConversionError: "\x8F" to UTF-8 in conversion from Windows-1252 to UTF-8
It looks like you have problem with detecting the valid encoding of your file. CharlockHolmes provide you with useful tip of :confidence=>37 which simply means the detected encoding may not be the right one.
Basing on error messages and test_transcode.rb from https://github.com/MacRuby/MacRuby/blob/master/test-mri/test/ruby/test_transcode.rb I found the encoding that passes through both of your error messages. With help of String#encode it's easy to test:
"\x8F\x98".encode("UTF-8","cp1256") # => "ڈک"
Your issue looks like strictly related to the file and not to ruby.
In case we are not sure which encoding to use and can agree to loose some character we can use :invalid and :undef params for String#encode, in this case:
"\x8F\x98".encode("UTF-8", "CP1250",:invalid => :replace, :undef => :replace, :replace => "?") # => "Ź?"
other way is to use Iconv *//IGNORE option for target encoding:
Iconv.iconv("UTF-8//IGNORE","CP1250", "\x8F\x98")
As a source encoding suggestion of CharlockHolmes should be pretty good.
PS. String.encode was introduced in ruby 1.9. With ruby 1.8 you can use Iconv

Ruby read CSV file as UTF-8 and/or convert ASCII-8Bit encoding to UTF-8

I'm using ruby 1.9.2
I'm trying to parse a CSV file that contains some French words (e.g. spécifié) and place the contents in a MySQL database.
When I read the lines from the CSV file,
file_contents = CSV.read("csvfile.csv", col_sep: "$")
The elements come back as Strings that are ASCII-8BIT encoded (spécifié becomes sp\xE9cifi\xE9), and strings like "spécifié" are then NOT properly saved into my MySQL database.
Yehuda Katz says that ASCII-8BIT is really "binary" data meaning that CSV has no idea how to read the appropriate encoding.
So, if I try to make CSV force the encoding like this:
file_contents = CSV.read("csvfile.csv", col_sep: "$", encoding: "UTF-8")
I get the following error
ArgumentError: invalid byte sequence in UTF-8:
If I go back to my original ASCII-8BIT encoded Strings and examine the String that my CSV read as ASCII-8BIT, it looks like this "Non sp\xE9cifi\xE9" instead of "Non spécifié".
I can't convert "Non sp\xE9cifi\xE9" to "Non spécifié" by doing this
"Non sp\xE9cifi\xE9".encode("UTF-8")
because I get this error:
Encoding::UndefinedConversionError: "\xE9" from ASCII-8BIT to UTF-8,
which Katz indicated would happen because ASCII-8BIT isn't really a proper String "encoding".
Questions:
Can I get CSV to read my file in the appropriate encoding? If so, how?
How do I convert an ASCII-8BIT string to UTF-8 for proper storage in MySQL?
deceze is right, that is ISO8859-1 (AKA Latin-1) encoded text. Try this:
file_contents = CSV.read("csvfile.csv", col_sep: "$", encoding: "ISO8859-1")
And if that doesn't work, you can use Iconv to fix up the individual strings with something like this:
require 'iconv'
utf8_string = Iconv.iconv('utf-8', 'iso8859-1', latin1_string).first
If latin1_string is "Non sp\xE9cifi\xE9", then utf8_string will be "Non spécifié". Also, Iconv.iconv can unmangle whole arrays at a time:
utf8_strings = Iconv.iconv('utf-8', 'iso8859-1', *latin1_strings)
With newer Rubies, you can do things like this:
utf8_string = latin1_string.force_encoding('iso-8859-1').encode('utf-8')
where latin1_string thinks it is in ASCII-8BIT but is really in ISO-8859-1.
With ruby >= 1.9 you can use
file_contents = CSV.read("csvfile.csv", col_sep: "$", encoding: "ISO8859-1:utf-8")
The ISO8859-1:utf-8 is meaning: The csv-file is ISO8859-1 - encoded, but convert the content to utf-8
If you prefer a more verbose code, you can use:
file_contents = CSV.read("csvfile.csv", col_sep: "$",
external_encoding: "ISO8859-1",
internal_encoding: "utf-8"
)
I have been dealing with this issue for a while and not any of the other solutions worked for me.
The thing that made the trick was to store the conflictive string in a binary File, then read the File normally and using this string to feed the CSV module:
tempfile = Tempfile.new("conflictive_string")
tempfile.binmode
tempfile.write(conflictive_string)
tempfile.close
cleaned_string = File.read(tempfile.path)
File.delete(tempfile.path)
csv = CSV.new(cleaned_string)

Resources