I'm testing 2 situations and getting 2 strangely different results.
First:
hash_data_file = CSV.parse(data_file).map {|line|
puts line[6]
abort
The return is Caixa Econômica Federal with accents in the right place.
Second:
hash_data_file = CSV.parse(data_file).map {|line|
puts :bank => line[6]
abort
But the return is {:bank=>"Caixa Econ\xC3\xB4mica Federal"}, a string with errors in the codification instead of the accents.
What am I doing wrong?
In the first case, your data_file is in UTF-8 encoding. In the second case, data_file has binary (i.e. 7-bit ASCII) encoding.
For example, if we start with a simple UTF-8 CSV file:
bank
Caixa Econômica Federal
and then parse it with UTF-8 encoding:
CSV.parse(File.open('pancakes.csv', encoding: 'utf-8'))
# [["bank"], ["Caixa Econômica Federal"]]
and then in binary encoding:
CSV.parse(File.open('pancakes.csv', encoding: 'binary'))
# [["bank"], ["Caixa Econ\xC3\xB4mica Federal"]]
So you need to fix the encoding by reading the file in the proper encoding. Hard to say more since we don't know how data_file is being opened.
Have a look at
line[6].encoding
and you should see #<Encoding:UTF-8> in the first case but #<Encoding:ASCII-8BIT> in the second.
There is no “error in codification.”
"Caixa Econ\xC3\xB4mica Federal" == "Caixa Econômica Federal"
#⇒ true
For some reason when printing out a hash, ruby uses this representation (I cannot reproduce it though,) but in a nutshell the string you see is good enough.
I have a string such as:
"MÃ\u0083¼LLER".encoding
#<Encoding:UTF-8>
"MÃ\u0083¼LLER".inspect
"\"MÃ\\u0083¼LLER\""
What can I do to salvage such a string? Take into consideration I do not have the original data. Is this salvageable?
Looks like the string was converted from utf-8 to latin-1 twice. Try this on some of your data and let me know if it worked:
require 'iconv'
def decode(str)
i = Iconv.new('LATIN1','UTF-8')
i.iconv(i.iconv(str)).force_encoding('UTF-8')
end
decode("MÃ\u0083¼LLER")
#=> "MüLLER"
I am using URI.unescape to unescape text, unfortunately I run into weird error:
# encoding: utf-8
require('uri')
URI.unescape("%C3%9Fą")
results in
C:/Ruby193/lib/ruby/1.9.1/uri/common.rb:331:in `gsub': incompatible character encodings: ASCII-8BIT and UTF-8 (Encoding::CompatibilityError)
from C:/Ruby193/lib/ruby/1.9.1/uri/common.rb:331:in `unescape'
from C:/Ruby193/lib/ruby/1.9.1/uri/common.rb:649:in `unescape'
from exe/fail.rb:3:in `<main>'
why?
Don't know why but you can use CGI.unescape method:
# encoding: utf-8
require 'cgi'
CGI.unescape("%C3%9Fą")
The implementation of URI.unescape is broken for non-ASCII inputs. The 1.9.3 version looks like this:
def unescape(str, escaped = #regexp[:ESCAPED])
str.gsub(escaped) { [$&[1, 2].hex].pack('C') }.force_encoding(str.encoding)
end
The regex in use is /%[a-fA-F\d]{2}/. So it goes through the string looking for a percent sign followed by two hex digits; in the block $& will be the matched text ('%C3' for example) and $&[1,2] be the matched text without the leading percent sign ('C3'). Then we call String#hex to convert that hexadecimal number to a Fixnum (195) and wrap it in an Array ([195]) so that we can use Array#pack to do the byte mangling for us. The problem is that pack gives us a single binary byte:
> puts [195].pack('C').encoding
ASCII-8BIT
The ASCII-8BIT encoding is also known as "binary" (i.e. plain bytes with no particular encoding). Then the block returns that byte and String#gsub tries to insert into the UTF-8 encoded copy of str that gsub is working on and you get your error:
incompatible character encodings: ASCII-8BIT and UTF-8 (Encoding::CompatibilityError)
because you can't (in general) just stuff binary bytes into a UTF-8 string; you can often get away with it:
URI.unescape("%C3%9F") # Works
URI.unescape("%C3µ") # Fails
URI.unescape("µ") # Works, but nothing to gsub here
URI.unescape("%C3%9Fµ") # Fails
URI.unescape("%C3%9Fpancakes") # Works
Things start falling apart once you start mixing non-ASCII data into your URL encoded string.
One simple fix is to switch the string to binary before try to decode it:
def unescape(str, escaped = #regexp[:ESCAPED])
encoding = str.encoding
str = str.dup.force_encoding('binary')
str.gsub(escaped) { [$&[1, 2].hex].pack('C') }.force_encoding(encoding)
end
Another option is to push the force_encoding into the block:
def unescape(str, escaped = #regexp[:ESCAPED])
str.gsub(escaped) { [$&[1, 2].hex].pack('C').force_encoding(encoding) }
end
I'm not sure why the gsub fails in some cases but succeeds in others.
To expand on Vasiliy's answer that suggests using CGI.unescape:
As of Ruby 2.5.0, URI.unescape is obsolete.
See https://ruby-doc.org/stdlib-2.5.0/libdoc/uri/rdoc/URI/Escape.html#method-i-unescape.
"This method is obsolete and should not be used. Instead, use CGI.unescape, URI.decode_www_form or URI.decode_www_form_component depending on your specific use case."
I have a string I have read from some kind of input.
To the best of my knowledge, it is UTF8. Okay:
string.force_encoding("utf8")
But if this string has bytes in it that are not in fact legal UTF8, I want to know now and take action.
Ordinarily, will force_encoding("utf8") raise if it encounters such bytes? I believe it will not.
If I was doing an #encode I could choose from the handy options with what to do with characters that are invalid in the source encoding (or destination encoding).
But I'm not doing an #encode, I'm doing a #force_encoding. It has no such options.
Would it make sense to
string.force_encoding("utf8").encode("utf8")
to get an exception right away? Normally encoding from utf8 to utf8 doesn't make any sense. But maybe this is the way to get it to raise right away if there's invalid bytes? Or use the :replace option etc to do something different with invalid bytes?
But no, can't seem to make that work either.
Anyone know?
1.9.3-p0 :032 > a = "bad: \xc3\x28 okay".force_encoding("utf-8")
=> "bad: \xC3( okay"
1.9.3-p0 :033 > a.valid_encoding?
=> false
Okay, but how do I find and eliminate those bad bytes? Oddly, this does NOT raise:
1.9.3-p0 :035 > a.encode("utf-8")
=> "bad: \xC3( okay"
If I was converting to a different encoding, it would!
1.9.3-p0 :039 > a.encode("ISO-8859-1")
Encoding::InvalidByteSequenceError: "\xC3" followed by "(" on UTF-8
Or if I told it to, it'd replace it with a "?" =>
1.9.3-p0 :040 > a.encode("ISO-8859-1", :invalid => :replace)
=> "bad: ?( okay"
So ruby's got the smarts to know what are bad bytes in utf-8, and to replace em with something else -- when converting to a different encoding. But I don't want to convert to a different encoding, i want to stay utf8 -- but I might want to raise if there's an invalid byte in there, or I might want to replace invalid bytes with replacement chars.
Isn't there some way to get ruby to do this?
update I believe this has finally been added to ruby in 2.1, with String#scrub present in the 2.1 preview release to do this. So look for that!
(update: see https://github.com/jrochkind/scrub_rb)
So I coded up a solution to what I needed here: https://github.com/jrochkind/ensure_valid_encoding/blob/master/lib/ensure_valid_encoding.rb
But only much more recently did I realize this actually IS built into the stdlib, you just need to, somewhat counter-intuitively, pass 'binary' as the "source encoding":
a = "bad: \xc3\x28 okay".force_encoding("utf-8")
a.encode("utf-8", "binary", :undef => :replace)
=> "bad: �( okay"
Yep, that's exactly what I wanted. So turns out this IS built into 1.9 stdlib, it's just undocumented and few people know it (or maybe few people that speak English know it?). Although I saw these arguments used this way on a blog somewhere, so someone else knew it!
In ruby 2.1, the stdlib finally supports this with scrub.
http://ruby-doc.org/core-2.1.0/String.html#method-i-scrub
make sure that your scriptfile itself is saved as UTF8 and try the following
# encoding: UTF-8
p [a = "bad: \xc3\x28 okay", a.valid_encoding?]
p [a.force_encoding("utf-8"), a.valid_encoding?]
p [a.encode!("ISO-8859-1", :invalid => :replace), a.valid_encoding?]
This gives on my windows7 system the following
["bad: \xC3( okay", false]
["bad: \xC3( okay", false]
["bad: ?( okay", true]
So your bad char is replaced, you can do it right away as follows
a = "bad: \xc3\x28 okay".encode!("ISO-8859-1", :invalid => :replace)
=> "bad: ?( okay"
EDIT: here a solution that works on any arbitrary encoding, the first encodes only the bad chars, the second just replaces by a ?
def validate_encoding(str)
str.chars.collect do |c|
(c.valid_encoding?) ? c:c.encode!(Encoding.locale_charmap, :invalid => :replace)
end.join
end
def validate_encoding2(str)
str.chars.collect do |c|
(c.valid_encoding?) ? c:'?'
end.join
end
a = "bad: \xc3\x28 okay"
puts validate_encoding(a) #=>bad: ?( okay
puts validate_encoding(a).valid_encoding? #=>true
puts validate_encoding2(a) #=>bad: ?( okay
puts validate_encoding2(a).valid_encoding? #=>true
To check that a string has no invalid sequences, try to convert it to the binary encoding:
# Returns true if the string has only valid sequences
def valid_encoding?(string)
string.encode('binary', :undef => :replace)
true
rescue Encoding::InvalidByteSequenceError => e
false
end
p valid_encoding?("\xc0".force_encoding('iso-8859-1')) # true
p valid_encoding?("\u1111") # true
p valid_encoding?("\xc0".force_encoding('utf-8')) # false
This code replaces undefined characters, because we don't care if there are valid sequences that cannot be represented in binary. We only care if there are invalid sequences.
A slight modification to this code returns the actual error, which has valuable information about the improper encoding:
# Returns the encoding error, or nil if there isn't one.
def encoding_error(string)
string.encode('binary', :undef => :replace)
nil
rescue Encoding::InvalidByteSequenceError => e
e.to_s
end
# Returns truthy if the string has only valid sequences
def valid_encoding?(string)
!encoding_error(string)
end
puts encoding_error("\xc0".force_encoding('iso-8859-1')) # nil
puts encoding_error("\u1111") # nil
puts encoding_error("\xc0".force_encoding('utf-8')) # "\xC0" on UTF-8
About the only thing I can think of is to transcode to something and back that won't damage the string in the round-trip:
string.force_encoding("UTF-8").encode("UTF-32LE").encode("UTF-8")
Seems rather wasteful, though.
Okay, here's a really lame pure ruby way to do it I figured out myself. It probably performs for crap. what the heck, ruby? Not selecting my own answer for now, hoping someone else will show up and give us something better.
# Pass in a string, will raise an Encoding::InvalidByteSequenceError
# if it contains an invalid byte for it's encoding; otherwise
# returns an equivalent string.
#
# OR, like String#encode, pass in option `:invalid => :replace`
# to replace invalid bytes with a replacement string in the
# returned string. Pass in the
# char you'd like with option `:replace`, or will, like String#encode
# use the unicode replacement char if it thinks it's a unicode encoding,
# else ascii '?'.
#
# in any case, method will raise, or return a new string
# that is #valid_encoding?
def validate_encoding(str, options = {})
str.chars.collect do |c|
if c.valid_encoding?
c
else
unless options[:invalid] == :replace
# it ought to be filled out with all the metadata
# this exception usually has, but what a pain!
raise Encoding::InvalidByteSequenceError.new
else
options[:replace] || (
# surely there's a better way to tell if
# an encoding is a 'Unicode encoding form'
# than this? What's wrong with you ruby 1.9?
str.encoding.name.start_with?('UTF') ?
"\uFFFD" :
"?" )
end
end
end.join
end
More ranting at http://bibwild.wordpress.com/2012/04/17/checkingfixing-bad-bytes-in-ruby-1-9-char-encoding/
If you are doing this for a "real-life" use case - for example for parsing different strings entered by users, and not just for the sake of being able to "decode" a totally random file which could be made of as many encodings as you wish, then I guess you could at least assume that all charcters for each string have the same encoding.
Then, in this case, what would you think about this?
strings = [ "UTF-8 string with some utf8 chars \xC3\xB2 \xC3\x93",
"ISO-8859-1 string with some iso-8859-1 chars \xE0 \xE8", "..." ]
strings.each { |s|
s.force_encoding "utf-8"
if s.valid_encoding?
next
else
while s.valid_encoding? == false
s.force_encoding "ISO-8859-1"
s.force_encoding "..."
end
s.encode!("utf-8")
end
}
I am not a Ruby "pro" in any way, so please forgive if my solution is wrong or even a bit naive..
I just try to give back what I can, and this is what I've come to, while I was (I still am) working on this little parser for arbitrarily encoded strings, which I am doing for a study-project.
While I'm posting this, I must admit that I've not even fully tested it.. I.. just got a couple of "positive" results, but I felt so excited of possibly having found what I was struggling to find (and for all the time I spent reading about this on SO..) that I just felt the need to share it as quick as possible, hoping that it could help save some time to anyone who has been looking for this for as long as I've been... .. if it works as expected :)
A simple way to provoke an exception seems to be:
untrusted_string.match /./
Here are 2 common situations and how to deal with them in Ruby 2.1+. I know, the question refers to Ruby v1.9, but maybe this is helpful for others finding this question via Google.
Situation 1
You have an UTF-8 string with possibly a few invalid bytes
Remove the invalid bytes:
str = "Partly valid\xE4 UTF-8 encoding: äöüß"
str.scrub('')
# => "Partly valid UTF-8 encoding: äöüß"
Situation 2
You have a string that could be in either UTF-8 or ISO-8859-1 encoding
Check which encoding it is and convert to UTF-8 (if necessary):
str = "String in ISO-8859-1 encoding: \xE4\xF6\xFC\xDF"
unless str.valid_encoding?
str.encode!( 'UTF-8', 'ISO-8859-1', invalid: :replace, undef: :replace, replace: '?' )
end #unless
# => "String in ISO-8859-1 encoding: äöüß"
Notes
The above code snippets assume that Ruby encodes all your strings in UTF-8 by default. Even though, this is almost always the case, you can make sure of this by starting your scripts with # encoding: UTF-8.
If invalid, it is programmatically possible to detect most multi-byte encodings like UTF-8 (in Ruby, see: #valid_encoding?). However, it is NOT (easily) possible to programmatically detect invalidity of single-byte-encodings like ISO-8859-1. Thus the above code snippet does not work the other way around, i.e. detecting if a String is valid ISO-8859-1 encoding.
Even though UTF-8 has become increasingly popular as the default encoding in the web, ISO-8859-1 and other Latin1 flavors are still very popular in the Western countries, especially in North America. Be aware that there a several single-byte encodings out there that are very similar, but slightly vary from ISO-8859-1. Examples: CP1252 (a.k.a. Windows-1252), ISO-8859-15
I have just recently upgraded to ruby 1.92 and one of my monkey patches is failing with some sort of encoding error. I have the following function:
def strip_noise()
return if (!self) || (self.size == 0)
self.delete(160.chr+194.chr).gsub(/[,]/, "").strip
end
That now gives me the following error:
incompatible character encodings:
UTF-8 and ASCII-8BIT
Has anyone else come across this?
This is working for me at the moment anyway:
class String
def strip_noise()
return if empty?
self.mb_chars.normalize(:kd).gsub(/[^\x00-\x7F]/n,'')
end
end
I need to do more testing but I can progress..
class String
def strip_noise
return if empty?
ActiveSupport::Inflector.transliterate self, ''
end
end
"#{160.chr}#{197.chr} string with noises" # => "\xA0\xC5 string with noises"
"#{160.chr}#{197.chr} string with noises".strip_noise # => "A string with noises"
This might not be exactly what you want:
def strip_noise
return if empty?
sub = 160.chr.force_encoding(encoding) + 194.chr.force_encoding(encoding)
delete(sub).gsub(/[,]/, "").strip
end
Read more on the topic here: http://yehudakatz.com/2010/05/17/encodings-unabridged/
It's not entirely clear what you're trying to do here, but 160.chr+194.chr is not valid UTF-8: 160 is a continuation byte, and 194 is the first byte of a 2-byte character. Reversed they form the unicode character for "non breaking space".
If you want to remove all non-ASCII-7 characters, try this:
s.delete!("^\u{0000}-\u{007F}")