how to fix Encoding::UndefinedConversionError in ruby - ruby

When trying to write a string to a file I get this message:
irb(main):011:0> IO.write("/tmp/a1", r1.body.to_s)
Encoding::UndefinedConversionError: "\xC2" from ASCII-8BIT to UTF-8
from (irb):11:in `write'
from (irb):11
irb(main):012:0>
What am I doing wrong?

I found a question like yours. Your string is in some other encoding, most likely iso-8859-1, so you should run this to convert it:
"\xC2".encode("iso-8859-1").force_encoding("utf-8")
=> "Ã"
See the original question on stackoverflow, the answer on the top right now seem to be usefull.

Related

Ruby incompatible character encodings

I am currently trying to write a script that iterates over an input file and checks data on a website. If it finds the new data, it prints out to the terminal that it passes, if it doesn't it tells me it fails. And vice versa for deleted data. It was working fine until the input file I was given contains the "™" character. Then when ruby gets to that line, it is spitting out an error:
PDAPWeb.rb:73:in `include?': incompatible character encodings: UTF-8 and IBM437
(Encoding::CompatibilityError)
The offending line is a simple check to see if the text exists on the page.
if browser.text.include? (program_name)
Where the program_name variable is a parsed piece of information from the input file. In this instance, the program_name contains the 'TM' character mentioned before.
After some research I found that adding the line # encoding: utf-8 to the beginning of my script could help, but so far has not proven useful.
I added this to my program_name variable to see if it would help(and it allowed my script to run without errors), but now it is not properly finding the TM character when it should be.
program_name = record[2].gsub("\n", '').force_encoding("utf-8").encode("IBM437", replace: nil)
This seemed to convert the TM character to this: Γäó
I thought maybe i had IBM437 and utf-8 parts reversed, so I tried the opposite
program_name = record[2].gsub("\n", '').force_encoding("IBM437").encode("utf-8", replace: nil)
and am now receiving this error when attempting to run the script
PDAPWeb.rb:48:in `encode': U+2122 from UTF-8 to IBM437 (Encoding::UndefinedConve
rsionError)
I am using ruby 1.9.3p392 (2013-02-22) and I'm not sure if I should upgrade as this is the standard version installed in my company.
Is my encoding incorrect and causing it to convert the TM character with errors?
Here’s what it looks like is going on. Your input file contains a ™ character, and it is in UTF-8 encoding. However when you read it, since you don’t specify the encoding, Ruby assumes it is in your system’s default encoding of IBM437 (you must be on Windows).
This is basically the same as this:
>> input = "™"
=> "™"
>> input.encoding
=> #<Encoding:UTF-8>
>> input.force_encoding 'ibm437'
=> "\xE2\x84\xA2"
Note that force_encoding doesn’t change the actual string, just the label associated with it. This is the same outcome as in your case, only you arrive here via a different route (by reading the file).
The web page also has a ™ symbol, and is also encoded as UTF-8, but in this case Ruby has the encoding correct (Watir probably uses the headers from the page):
>> web_page = '™'
=> "™"
>> web_page.encoding
=> #<Encoding:UTF-8>
Now when you try to compare these two strings you get the compatibility error, because they have different encodings:
>> web_page.include? input
Encoding::CompatibilityError: incompatible character encodings: UTF-8 and IBM437
from (irb):11:in `include?'
from (irb):11
from /Users/matt/.rvm/rubies/ruby-2.2.1/bin/irb:11:in `<main>'
If either of the two strings only contained ASCII characters (i.e. code points less that 128) then this comparison would have worked. Both UTF-8 and IBM437 are both supersets of ASCII, and are only incompatible if they both contain characters outside of the ASCII range. This is why you only started seeing this behaviour when the input file had a ™.
The fix is to inform Ruby what the actual encoding of the input file is. You can do this with the already loaded string:
>> input.force_encoding 'utf-8'
=> "™"
You can also do this when reading the file, e.g. (there are a few ways of reading files, they all should allow you to explicitly specify the encoding):
input = File.read("input_file.txt", :encoding => "utf-8")
# now input will be in the correct encoding
Note in both of these the string isn’t being changed, it still contains the same bytes, but Ruby now knows its correct encoding.
Now the comparison should work okay:
>> web_page.include? input
=> true
There is no need to encode the string. Here’s what happens if you do. First if you correct the encoding to UTF-8 then encode to IBM437:
>> input.force_encoding("utf-8").encode("IBM437", replace: nil)
Encoding::UndefinedConversionError: U+2122 from UTF-8 to IBM437
from (irb):16:in `encode'
from (irb):16
from /Users/matt/.rvm/rubies/ruby-2.2.1/bin/irb:11:in `<main>'
IBM437 doesn’t include the ™ character, so you can’t encode a string containing it to this encoding without losing data. By default Ruby raises an exception when this happens. You can force the encoding by using the :undef option, but the symbol is lost:
>> input.force_encoding("utf-8").encode("IBM437", :undef => :replace)
=> "?"
If you go the other way, first using force_encoding to IBM437 then encoding to UTF-8 you get the string Γäó:
>> input.force_encoding("IBM437").encode("utf-8", replace: nil)
=> "Γäó"
The string is already in IBM437 encoding as far as Ruby is concerned, so force_encoding doesn’t do anything. The UTF-8 representation of ™ is the three bytes 0xe2 0x84 0xa2, and when interpreted as IBM437 these bytes correspond to the three characters seen here which are then converted into their UTF-8 representations.
(These two outcomes are the other way round from what you describe in the question, hence my comment above. I’m assuming that this is just a copy-and-paste error.)

How to match Chinese word in Ruby?

I want to match Chinese word in a string, but it failed
irb(main):016:0> "身高455478".scan(/\p{Han}/)
SyntaxError: (irb):16: invalid character property name {Han}: /\p{Han}/
from C:/Program Files/Ruby-2.1.0/bin/irb.bat:18:in `<main>'
What's wrong with it?
The problem is very strange, is it the character encoding problem?
I can reproduce the problem in irb. The difference between my Ruby environment and others who can't reproduce the problem is, my encoding in irb is by default GBK which is for Chinese.
This can reproduce the problem:
#encoding:GBK
p "身高455478".scan(/\p{Han}/)
shows error: invalid character property name {Han}: /\p{Han}/
To fix the problem, use the UTF-8 encoding:
#encoding:utf-8
p "身高455478".scan(/\p{Han}/)
Outputs: ["\u8EAB", "\u9AD8"]
As #Stefan suggests, to set irb to use UTF-8 encoding, start irb using irb -E UTF-8.
To encode this one string, use String#encode:
'身高455478'.encode('utf-8').scan(/\p{Han}/u)
#=> ["\u8EAB", "\u9AD8"]

ruby `encode': "\xC3" from ASCII-8BIT to UTF-8 (Encoding::UndefinedConversionError)

Hannibal episodes in tvdb have weird characters in them.
For example:
Œuf
So ruby spits out:
./manifesto.rb:19:in `encode': "\xC3" from ASCII-8BIT to UTF-8 (Encoding::UndefinedConversionError)
from ./manifesto.rb:19:in `to_json'
from ./manifesto.rb:19:in `<main>'
Line 19 is:
puts #tree.to_json
Is there a way to deal with these non utf characters? I'd rather not replace them, but convert them? Or ignore them? I don't know, any help appreciated.
Weird part is that script works fine via cron. Manually running it creates error.
File.open(yml_file, 'w') should be change to File.open(yml_file, 'wb')
It seems you should use another encoding for the object. You should set the proper codepage to the variable #tree, for instance, using iso-8859-1 instead of ascii-8bit by using #tree.force_encoding('ISO-8859-1'). Because ASCII-8BIT is used just for binary files.
To find the current external encoding for ruby, issue:
Encoding.default_external
If sudo solves the problem, the problem was in default codepage (encoding), so to resolve it you have to set the proper default codepage (encoding), by either:
In ruby to change encoding to utf-8 or another proper one, do as follows:
Encoding.default_external = Encoding::UTF_8
In bash, grep current valid set up:
$ sudo env|grep UTF-8
LC_ALL=ru_RU.UTF-8
LANG=ru_RU.UTF-8
Then set them in .bashrc properly, in a similar way, but not exactly with ru_RU language, such as the following:
export LC_ALL=ru_RU.UTF-8
export LANG=ru_RU.UTF-8
I had the same problems when saving to the database. I'll offer one thing that I use (perhaps, this will help someone).
if you know that sometimes your text has strange characters, then
before saving you can encode your text in some other format, and then
decode the text again after it is returned from the database.
example:
string = "Œuf"
before save we encode string
text_to_save = CGI.escape(string)
(character "Œ" encoded in "%C5%92" and other characters remained the same)
=> "%C5%92uf"
load from database and decode
CGI.unescape("%C5%92uf")
=> "Œuf"
I just suffered through a number of hours trying to fix a similar problem. I'd checked my locales, database encoding, everything I could think of and was still getting ASCII-8BIT encoded data from the database.
Well, it turns out that if you store text in a binary field, it will automatically be returned as ASCII-8BIT encoded text, which makes sense, however this can (obviously) cause problems in your application.
It can be fixed by changing the column encoding back to :text in your migrations.

gsub codification issues with UTF-8

I am trying to create a slug from some usernames in a DB migration.
nick = nick.gsub('á','a')
I really want change also éíóúñ to eioun.
Doing so, it doesn't work, I will get:
incompatible encoding regexp match (UTF-8 regexp with ASCII-8BIT string) (Encoding::CompatibilityError)
But, however I do, for example by adding force_encoding method, I always get encodings errors like:
invalid byte sequence in UTF-8 (ArgumentError)
"\xF3" from ASCII-8BIT to UTF-8 (Encoding::UndefinedConversionError)
incompatible character encodings: ASCII-8BIT and UTF-8 (Encoding::CompatibilityError)
This just happends when I have a gsub for changing those vocals or the spanish ñ letter.
There's also a encoding: utf-8 line on my file and data comes from a UTF-8 database. But nothing seems to help.
I've seen some questions on SO but anything I try to do doesn't fix it.
By the way, this is not rails related.
I finally used transliterate from Rails ActiveSupport:
require 'active_support/all'
v = ActiveSupport::Inflector.transliterate v.downcase
v.gsub(/[^a-z1-9]+/, '-').chomp('-')
Works fine.

How to change deprecated iconv to String#encode for invalid UTF8 correction

I get sources from the web and sometimes the encoding of the material is not 100% UTF8 byte sequence valid. I use iconv to silently ignore these sequences to get a cleaned string.
#iconv = Iconv.new('UTF-8//IGNORE', 'UTF-8')
valid_string = #iconv.iconv(untrusted_string)
However now the iconv has been deprecated, I see its deprecation warning a lot.
iconv will be deprecated in the future, use String#encode
I tried the converting it, using String#encode's :invalid and :replace options, but it seems not to be working (i.e. the incorrect byte sequence has not been removed). What is the correct way to use String#encode for this?
This has been answered in this question:
Is there a way in ruby 1.9 to remove invalid byte sequences from strings?
Use either
untrusted_string.chars.select{|i| i.valid_encoding?}.join
or
untrusted_string.encode('UTF-8', :invalid => :replace, :replace => '').encode('UTF-8')
The question that Martijn linked to has what seem to be the two best ways to do that, but Martijn made an understandable but incorrect change when copying the second approach to his answer here. Doing .encode('UTF-8', <options>).encode('UTF-8') doesn't work. As indicated in the original answer in the other question, the key is to encode to a different encoding, then back to UTF-8. If your original string is already flagged as UTF-8 in ruby's internals then ruby will ignore any call to encode it as UTF-8.
In the following examples I'm going to use "a#{0xFF.chr}b".force_encoding('UTF-8') to produce a string that ruby believes is UTF-8 but which contains invalid UTF-8 bytes.
1.9.3p194 :019 > "a#{0xFF.chr}b".force_encoding('UTF-8')
=> "a\xFFb"
1.9.3p194 :020 > "#{0xFF.chr}".force_encoding('UTF-8').encoding
=> #<Encoding:UTF-8>
Note how encoding to UTF-8 does nothing:
1.9.3p194 :016 > "a#{0xFF.chr}b".force_encoding('UTF-8').encode('UTF-8', :invalid => :replace, :replace => '').encode('UTF-8')
=> "a\xFFb"
But encoding to something else (UTF-16) and then back to UTF-8 cleans up the string:
1.9.3p194 :017 > "a#{0xFF.chr}b".force_encoding('UTF-8').encode('UTF-16', :invalid => :replace, :replace => '').encode('UTF-8')
=> "ab"

Resources