Delete non-UTF characters from a string in Ruby? - ruby

How do I delete non-UTF8 characters from a ruby string? I have a string that has for example "xC2" in it. I want to remove that char from the string so that it becomes a valid UTF8.
This:
text.gsub!(/\xC2/, '')
returns an error:
incompatible encoding regexp match (ASCII-8BIT regexp with UTF-8 string)
I was looking at text.unpack('U*') and string.pack as well, but did not get anywhere.

You can use encode for that.
text.encode('UTF-8', :invalid => :replace, :undef => :replace)
For more info look into Ruby-Docs

You could do it like this
# encoding: utf-8
class String
def validate_encoding
chars.select(&:valid_encoding?).join
end
end
puts "testing\xC2 a non UTF-8 string".validate_encoding
#=>testing a non UTF-8 string

You text have ASCII-8BIT encoding, instead you should use this:
String.delete!("^\u{0000}-\u{007F}");
It will serve the same purpose.

You can use /n, as in
text.gsub!(/\xC2/n, '')
to force the Regexp to operate on bytes.
Are you sure this is what you want, though? Any Unicode character in the range [U+80, U+BF] will have a \xC2 in its UTF-8 encoded form.

Try Iconv
1.9.3p194 :001 > require 'iconv'
# => true
1.9.3p194 :002 > string = "testing\xC2 a non UTF-8 string"
# => "testing\xC2 a non UTF-8 string"
1.9.3p194 :003 > ic = Iconv.new('UTF-8//IGNORE', 'UTF-8')
# => #<Iconv:0x000000026c9290>
1.9.3p194 :004 > ic.iconv string
# => "testing a non UTF-8 string"

The best solution to this problem that I've found is this answer to the same question: https://stackoverflow.com/a/8711118/363293.
In short: "€foo\xA0".chars.select(&:valid_encoding?).join

data = '' if not (data.force_encoding("UTF-8").valid_encoding?)

Related

Ruby converting string encoding from ISO-8859-1 to UTF-8 not working

I am trying to convert a string from ISO-8859-1 encoding to UTF-8 but I can't seem to get it work. Here is an example of what I have done in irb.
irb(main):050:0> string = 'Norrlandsvägen'
=> "Norrlandsvägen"
irb(main):051:0> string.force_encoding('iso-8859-1')
=> "Norrlandsv\xC3\xA4gen"
irb(main):052:0> string = string.encode('utf-8')
=> "Norrlandsvägen"
I am not sure why Norrlandsvägen in iso-8859-1 will be converted into Norrlandsvägen in utf-8.
I have tried encode, encode!, encode(destinationEncoding, originalEncoding), iconv, force_encoding, and all kinds of weird work-arounds I could think of but nothing seems to work. Can someone please help me/point me in the right direction?
Ruby newbie still pulling hair like crazy but feeling grateful for all the replies here... :)
Background of this question: I am writing a gem that will download an xml file from some websites (which will have iso-8859-1 encoding) and save it in a storage and I would like to convert it to utf-8 first. But words like Norrlandsvägen keep messing me up. Really any help would be greatly appreciated!
[UPDATE]: I realized running tests like this in the irb console might give me different behaviors so here is what I have in my actual code:
def convert_encoding(string, originalEncoding)
puts "#{string.encoding}" # ASCII-8BIT
string.encode(originalEncoding)
puts "#{string.encoding}" # still ASCII-8BIT
string.encode!('utf-8')
end
but the last line gives me the following error:
Encoding::UndefinedConversionError - "\xC3" from ASCII-8BIT to UTF-8
Thanks to #Amadan's answer below, I noticed that \xC3 actually shows up in irb if you run:
irb(main):001:0> string = 'ä'
=> "ä"
irb(main):002:0> string.force_encoding('iso-8859-1')
=> "\xC3\xA4"
I have also tried to assign a new variable to the result of string.encode(originalEncoding) but got an even weirder error:
newString = string.encode(originalEncoding)
puts "#{newString.encoding}" # can't even get to this line...
newString.encode!('utf-8')
and the error is Encoding::UndefinedConversionError - "\xC3" to UTF-8 in conversion from ASCII-8BIT to UTF-8 to ISO-8859-1
I am still quite lost in all of this encoding mess but I am really grateful for all the replies and help everyone has given me! Thanks a ton! :)
You assign a string, in UTF-8. It contains ä. UTF-8 represents ä with two bytes.
string = 'ä'
string.encoding
# => #<Encoding:UTF-8>
string.length
# 1
string.bytes
# [195, 164]
Then you force the bytes to be interpreted as if they were ISO-8859-1, without actually changing the underlying representation. This does not contain ä any more. It contains two characters, Ã and ¤.
string.force_encoding('iso-8859-1')
# => "\xC3\xA4"
string.length
# 2
string.bytes
# [195, 164]
Then you translate that into UTF-8. Since this is not reinterpretation but translation, you keep the two characters, but now encoded in UTF-8:
string = string.encode('utf-8')
# => "ä"
string.length
# 2
string.bytes
# [195, 131, 194, 164]
What you are missing is the fact that you originally don't have an ISO-8859-1 string, as you would from your Web-service - you have gibberish. Fortunately, this is all in your console tests; if you read the response of the website using the proper input encoding, it should all work okay.
For your console test, let's demonstrate that if you start with a proper ISO-8859-1 string, it all works:
string = 'Norrlandsvägen'.encode('iso-8859-1')
# => "Norrlandsv\xE4gen"
string = string.encode('utf-8')
# => "Norrlandsvägen"
EDIT For your specific problem, this should work:
require 'net/https'
uri = URI.parse("https://rusta.easycruit.com/intranet/careerbuilder_se/export/xml/full")
options = {
:use_ssl => uri.scheme == 'https',
:verify_mode => OpenSSL::SSL::VERIFY_NONE
}
response = Net::HTTP.start(uri.host, uri.port, options) do |https|
https.request(Net::HTTP::Get.new(uri.path))
end
body = response.body.force_encoding('ISO-8859-1').encode('UTF-8')
There's a difference between force_encoding and encode. The former sets the encoding for the string, whereas the latter actually transcodes the contents of the string to the new encoding. Consequently, the following code causes your problem:
string = "Norrlandsvägen"
string.force_encoding('iso-8859-1')
puts string.encode('utf-8') # Norrlandsvägen
Whereas the following code will actually correctly encode your contents:
string = "Norrlandsvägen".encode('iso-8859-1')
string.encode!('utf-8')
Here's an example running in irb:
irb(main):023:0> string = "Norrlandsvägen".encode('iso-8859-1')
=> "Norrlandsv\xE4gen"
irb(main):024:0> string.encoding
=> #<Encoding:ISO-8859-1>
irb(main):025:0> string.encode!('utf-8')
=> "Norrlandsvägen"
irb(main):026:0> string.encoding
=> #<Encoding:UTF-8>
The above answer was spot on. Specifically this point here:
There's a difference between force_encoding and encode. The former
sets the encoding for the string, whereas the latter actually
transcodes the contents of the string to the new encoding.
In my situation, I had a text file with iso-8859-1 encoding. By default, Ruby uses UTF-8 encoding, so if you were to try to read the file without specifying the encoding, then you would get an error:
results = File.read(file)
results.encoding
=> #<Encoding:UTF-8>
results.split("\r\n")
ArgumentError: invalid byte sequence in UTF-8
You get an invalid byte sequence error because the characters in different encodings are represented by different byte lengths. Consequently, you would need to specify the encoding to the File API. Think of it like force_encoding:
results = File.read(file, encoding: "iso-8859-1")
So everything is good right? No, not if you want to start parsing the iso-8859-1 string with UTF-8 character encodings:
results = File.read(file, encoding: "iso-8859-1")
results.each do |line|
puts line.split('¬')
end
Encoding::CompatibilityError: incompatible character encodings: ISO-8859-1 and UTF-8
Why this error? Because '¬' is represented as UTF-8. You are using a UTF-8 character sequence against an ISO-8859-1 string. They are incompatible encodings. Consequently, after you read the File as a ISO-8859-1, then you can ask Ruby to encode that ISO-8859-1 into a UTF-8. And now you will be working with UTF-8 strings and thus no problems:
results = File.read(file, encoding: "iso-8859-1").encode('UTF-8')
results.encoding
results = results.split("\r\n")
results.each do |line|
puts line.split('¬')
end
Ultimately, with some Ruby APIs, you do not need to use force_encoding('ISO-8859-1'). Instead, you just specify the expected encoding to the API. However, you must convert it back to UTF-8 if you plan to parse it with UTF-8 strings.

How can I use encode utf-8 in Ruby?

I am trying to extract a word from a first line of file:
LOCATION,Feij�,AC,a,b,c
this way:
2.0.0-p247 :005 > File.foreach(file).first
=> "LOCATION,Feij\xF3,AC,a,b,c\r\n"`
but when I try to use split:
2.0.0-p247 :008 > File.foreach(file).first.split(",")
ArgumentError: invalid byte sequence in UTF-8 from (irb):8:in split'
from (irb):8 from
/home/bleh/.rvm/rubies/ruby-2.0.0-p247/bin/irb:13:in'
What I expected is: Feijó
I already try a lot of combinations like .encode and .force_encoding.
Some ideas?
The character ó is \xF3 in the ISO-8859-1 encoding, so this is probably the encoding of the file (it could also be CP-1252.
You can specify the encoding as an arg to File::foreach, and you can also ask Ruby to re-encode it to UTF-8 for you:
File.foreach(file, :encoding => 'iso-8859-1:utf-8').first.split(",")

Ruby to_json issue with error "illegal/malformed utf-8"

I got an error JSON::GeneratorError: source sequence is illegal/malformed utf-8 when trying to convert a hash into json string. I am wondering if this has anything to do with encoding, and how can I make to_json just treat \xAE as it is?
$ irb
2.0.0-p247 :001 > require 'json'
=> true
2.0.0-p247 :002 > a = {"description"=> "iPhone\xAE"}
=> {"description"=>"iPhone\xAE"}
2.0.0-p247 :003 > a.to_json
JSON::GeneratorError: source sequence is illegal/malformed utf-8
from (irb):3:in `to_json'
from (irb):3
from /Users/cchen21/.rvm/rubies/ruby-2.0.0-p247/bin/irb:16:in `<main>'
\xAE is not a valid character in UTF-8, you have to use \u00AE instead:
"iPhone\u00AE"
#=> "iPhone®"
Or convert it accordingly:
"iPhone\xAE".force_encoding("ISO-8859-1").encode("UTF-8")
#=> "iPhone®"
Every string in Ruby has a underlaying encoding. Depending on your LANG and LC_ALL environment variables, the interactive shell might be executing and interpreting your strings in a given encoding.
$ irb
1.9.3p392 :008 > __ENCODING__
=> #<Encoding:UTF-8>
(ignore that I’m using Ruby 1.9 instead of 2.0, the ideas are still the same).
__ENCODING__ returns the current source encoding. Yours will probably also say UTF-8.
When you create literal strings and use byte escapes (the \xAE) in your code, Ruby is trying to interpret that according to the string encoding:
1.9.3p392 :003 > a = {"description" => "iPhone\xAE"}
=> {"description"=>"iPhone\xAE"}
1.9.3p392 :004 > a["description"].encoding
=> #<Encoding:UTF-8>
So, the byte \xAE at the end of your literal string will be tried to be treated as a UTF-8 stream byte, but it is invalid. See what happens when I try to print it:
1.9.3-p392 :001 > puts "iPhone\xAE"
iPhone�
=> nil
You either need to provide the registered mark character in a valid UTF-8 encoding (either using the real character, or providing the two UTF-8 bytes):
1.9.3-p392 :002 > a = {"description1" => "iPhone®", "description2" => "iPhone\xc2\xae"}
=> {"description1"=>"iPhone®", "description2"=>"iPhone®"}
1.9.3-p392 :005 > a.to_json
=> "{\"description1\":\"iPhone®\",\"description2\":\"iPhone®\"}"
Or, if your input is ISO-8859-1 (Latin 1) and you know it for sure, you can tell Ruby to interpret your string as another encoding:
1.9.3-p392 :006 > a = {"description1" => "iPhone\xAE".force_encoding('ISO-8859-1') }
=> {"description1"=>"iPhone\xAE"}
1.9.3-p392 :007 > a.to_json
=> "{\"description1\":\"iPhone®\"}"
Hope it helps.

how to select dropdown having Encoding::UndefinedConversionError in watir?

I want to select dropdown having text="Côte d'Ivoire".
ie.select_list(:id, "name01").select("#{text}")
I tried these codes,
1.encoding: UTF-8 #not working
2.text.force_encoding("ASCII-8BIT").encode('UTF-8', undef: :replace, replace:'')
#text=Cte d'Ivoire
what should I do for it?
I also want to save this text to my DB.Please help.
If you know the string is UTF-8 encoded, why not just force encoding to UTF-8?
#encoding: ASCII-8BIT
str = "C\xC3\xB4te d'Ivoire" # => "C\xC3\xB4te d'Ivoire"
str.encoding # => #<Encoding:ASCII-8BIT>
str.force_encoding('UTF-8')
str # => "Côte d'Ivoire"
str.encoding # => #<Encoding:UTF-8>
If you are using Côte d'Ivoire as a literal anywhere in your Ruby source files, be sure to add
#encoding: UTF-8
as the first line of the file to tell Ruby that the file is UTF-8 encoded.
I would have expected your solutions to work, unless the software you are using to save/execute the files is overriding the setting. I recall having that issue with NetBeans.
An alternative, if you cannot fix the actual encoding, is to use a regex to match just the standard characters.
text = /C.te d'Ivoire/
browser.select_list.select(text)
The regex has replaced all accented characters with a ..
Not a great solution, but perhaps a solution if nothing else works.

convert utf-8 to unicode in ruby

The UTF-8 of "龅" is E9BE85 and the unicode is U+9F85. Following code did not work as expected:
irb(main):004:0> "龅"
=> "\351\276\205"
irb(main):005:0> Iconv.iconv("unicode","utf-8","龅").to_s
=> "\377\376\205\237"
P.S: I am using Ruby1.8.7.
Ruby 1.9+ is much better equipped to deal with Unicode than 1.8.7, so, I strongly suggest running under 1.9.2 if at all possible.
Part of the problem is that 1.8 didn't understand that a UTF-8 or Unicode character could be more than one byte long. 1.9 does understand that and introduces things like String#each_char.
require 'iconv'
# encoding: UTF-8
RUBY_VERSION # => "1.9.2"
"龅".encoding # => #<Encoding:UTF-8>
"龅".each_char.entries # => ["龅"]
Iconv.iconv("unicode","utf-8","龅").to_s # =>
# ~> -:8:in `iconv': invalid encoding ("unicode", "utf-8") (Iconv::InvalidEncoding)
# ~> from -:8:in `<main>'
To get the list of available encodings with Iconv, do:
require 'iconv'
puts Iconv.list
It's a long list so I won't add it here.
You can try this:
"%04x" % "龅".unpack("U*")[0]
=> "9f85"
Should use UNICODEBIG// as the target encoding
irb(main):014:0> Iconv.iconv("UNICODEBIG//","utf-8","龅")[0].each_byte {|b| puts b.to_s(16)}
9f
85
=> "\237\205"

Resources