how to select dropdown having Encoding::UndefinedConversionError in watir? - ruby

I want to select dropdown having text="Côte d'Ivoire".
ie.select_list(:id, "name01").select("#{text}")
I tried these codes,
1.encoding: UTF-8 #not working
2.text.force_encoding("ASCII-8BIT").encode('UTF-8', undef: :replace, replace:'')
#text=Cte d'Ivoire
what should I do for it?
I also want to save this text to my DB.Please help.

If you know the string is UTF-8 encoded, why not just force encoding to UTF-8?
#encoding: ASCII-8BIT
str = "C\xC3\xB4te d'Ivoire" # => "C\xC3\xB4te d'Ivoire"
str.encoding # => #<Encoding:ASCII-8BIT>
str.force_encoding('UTF-8')
str # => "Côte d'Ivoire"
str.encoding # => #<Encoding:UTF-8>
If you are using Côte d'Ivoire as a literal anywhere in your Ruby source files, be sure to add
#encoding: UTF-8
as the first line of the file to tell Ruby that the file is UTF-8 encoded.

I would have expected your solutions to work, unless the software you are using to save/execute the files is overriding the setting. I recall having that issue with NetBeans.
An alternative, if you cannot fix the actual encoding, is to use a regex to match just the standard characters.
text = /C.te d'Ivoire/
browser.select_list.select(text)
The regex has replaced all accented characters with a ..
Not a great solution, but perhaps a solution if nothing else works.

Related

How to use :replace, :invalid and :undef args for encoding using CSV.read?

Since ruby 1.9, CSV uses a parser that can perform encoding, if you use methods like:
::foreach, ::open, ::read, and ::readlines.
For example: CSV.read('path/to/file', encoding: "windows-1252:UTF-8") tries to read a file in windows-1252 and returns an array with utf-8 encoded strings.
If the encode conversion between charsets has undefined characters it gives an Encoding::UndefinedConversionError.
The String.encode method has some nice args to deal with this undefined characters:
str = str.encode('UTF-8', invalid: :replace, undef: :replace, replace: "" )
Is there a way to use this kind of replace rules for undefined conversions between charsets with CSV parser?
Thank you.
There is, indeed, a way. The trick is to define a custom converter that does the conversion you want using String#encode. Converters are run before CSV tries to do its automatic conversion to UTF-8. We pass the custom converter to CSV.read as the :converters option, along with the original :encoding:
UTF8_CONVERTER = ->(field) { field.encode('utf-8', invalid: :replace, undef: :replace, replace: "") }
CSV.read('foo.csv', encoding: 'windows-1252', converters: UTF8_CONVERTER)
Since there aren't any characters in Windows-1252 that aren't also in UTF-8, I'll demonstrate the other way around. Suppose you have this UTF-8 CSV file:
foo,bar
yes👍,no💩
And suppose I want to convert it to ASCII-8BIT (because reasons?). This gives me an error:
CSV.read('emoji.csv', encoding: 'utf-8:ascii-8bit')
# => Encoding::UndefinedConversionError: U+1F44D from UTF-8 to ASCII-8BIT
But if I define a custom converter that replaces those undefined characters, it works perfectly:
ASCII_CONVERTER = ->(field) { field.encode('ascii-8bit', replace: "#") }
CSV.read('emoji.csv', encoding: 'utf-8', converters: ASCII_CONVERTER)
# => [ [ "foo", "bar" ],
# [ "yes#", "no#"] ]
(Note that encoding: 'utf-8' isn't strictly necessary here, since UTF-8 is the default, but it will be necessary if your file has a different encoding.)
If you want to use the replace behavior of String#encode, you will either have to encode the whole file content with it or do it line by line. You will lose information with this.
This is one way of doing it though:
file = File.open('path/to/file.csv')
file.each do |line|
# keep in mind that the first parameter here is the destination encoding,
# the second is the source encoding
sanitized_line = line.encode('UTF-8', 'windows-1252', invalid: :replace, undef: :replace, replace: '')
fields_array = CSV.parse_line(sanitized_line)
# do whatever you want with the fields you extracted
end
If your conversion from one encoding to another is pretty much guaranteed to not loose information (iso-8859-1 to utf-8 for example) I would really recommend to simply convert the file on reading.
Another thing to keep in mind is, that ruby does not try to figure out the encoding of a file you are reading on it's own. If you omit the parameter it only uses the default encoding for it's external and internal encoding. So you have to specify the encoding the file is in yourself. Ruby has no really reliable way of doing this, so in my case I ended up doing this (on a Ubuntu system):
encoding = `file --mime-encoding #{path_to_file} | awk '{print $2}'`.strip
arr_of_arrs = CSV.read(path_to_file, encoding: "#{encoding}:utf-8")

Ruby converting string encoding from ISO-8859-1 to UTF-8 not working

I am trying to convert a string from ISO-8859-1 encoding to UTF-8 but I can't seem to get it work. Here is an example of what I have done in irb.
irb(main):050:0> string = 'Norrlandsvägen'
=> "Norrlandsvägen"
irb(main):051:0> string.force_encoding('iso-8859-1')
=> "Norrlandsv\xC3\xA4gen"
irb(main):052:0> string = string.encode('utf-8')
=> "Norrlandsvägen"
I am not sure why Norrlandsvägen in iso-8859-1 will be converted into Norrlandsvägen in utf-8.
I have tried encode, encode!, encode(destinationEncoding, originalEncoding), iconv, force_encoding, and all kinds of weird work-arounds I could think of but nothing seems to work. Can someone please help me/point me in the right direction?
Ruby newbie still pulling hair like crazy but feeling grateful for all the replies here... :)
Background of this question: I am writing a gem that will download an xml file from some websites (which will have iso-8859-1 encoding) and save it in a storage and I would like to convert it to utf-8 first. But words like Norrlandsvägen keep messing me up. Really any help would be greatly appreciated!
[UPDATE]: I realized running tests like this in the irb console might give me different behaviors so here is what I have in my actual code:
def convert_encoding(string, originalEncoding)
puts "#{string.encoding}" # ASCII-8BIT
string.encode(originalEncoding)
puts "#{string.encoding}" # still ASCII-8BIT
string.encode!('utf-8')
end
but the last line gives me the following error:
Encoding::UndefinedConversionError - "\xC3" from ASCII-8BIT to UTF-8
Thanks to #Amadan's answer below, I noticed that \xC3 actually shows up in irb if you run:
irb(main):001:0> string = 'ä'
=> "ä"
irb(main):002:0> string.force_encoding('iso-8859-1')
=> "\xC3\xA4"
I have also tried to assign a new variable to the result of string.encode(originalEncoding) but got an even weirder error:
newString = string.encode(originalEncoding)
puts "#{newString.encoding}" # can't even get to this line...
newString.encode!('utf-8')
and the error is Encoding::UndefinedConversionError - "\xC3" to UTF-8 in conversion from ASCII-8BIT to UTF-8 to ISO-8859-1
I am still quite lost in all of this encoding mess but I am really grateful for all the replies and help everyone has given me! Thanks a ton! :)
You assign a string, in UTF-8. It contains ä. UTF-8 represents ä with two bytes.
string = 'ä'
string.encoding
# => #<Encoding:UTF-8>
string.length
# 1
string.bytes
# [195, 164]
Then you force the bytes to be interpreted as if they were ISO-8859-1, without actually changing the underlying representation. This does not contain ä any more. It contains two characters, Ã and ¤.
string.force_encoding('iso-8859-1')
# => "\xC3\xA4"
string.length
# 2
string.bytes
# [195, 164]
Then you translate that into UTF-8. Since this is not reinterpretation but translation, you keep the two characters, but now encoded in UTF-8:
string = string.encode('utf-8')
# => "ä"
string.length
# 2
string.bytes
# [195, 131, 194, 164]
What you are missing is the fact that you originally don't have an ISO-8859-1 string, as you would from your Web-service - you have gibberish. Fortunately, this is all in your console tests; if you read the response of the website using the proper input encoding, it should all work okay.
For your console test, let's demonstrate that if you start with a proper ISO-8859-1 string, it all works:
string = 'Norrlandsvägen'.encode('iso-8859-1')
# => "Norrlandsv\xE4gen"
string = string.encode('utf-8')
# => "Norrlandsvägen"
EDIT For your specific problem, this should work:
require 'net/https'
uri = URI.parse("https://rusta.easycruit.com/intranet/careerbuilder_se/export/xml/full")
options = {
:use_ssl => uri.scheme == 'https',
:verify_mode => OpenSSL::SSL::VERIFY_NONE
}
response = Net::HTTP.start(uri.host, uri.port, options) do |https|
https.request(Net::HTTP::Get.new(uri.path))
end
body = response.body.force_encoding('ISO-8859-1').encode('UTF-8')
There's a difference between force_encoding and encode. The former sets the encoding for the string, whereas the latter actually transcodes the contents of the string to the new encoding. Consequently, the following code causes your problem:
string = "Norrlandsvägen"
string.force_encoding('iso-8859-1')
puts string.encode('utf-8') # Norrlandsvägen
Whereas the following code will actually correctly encode your contents:
string = "Norrlandsvägen".encode('iso-8859-1')
string.encode!('utf-8')
Here's an example running in irb:
irb(main):023:0> string = "Norrlandsvägen".encode('iso-8859-1')
=> "Norrlandsv\xE4gen"
irb(main):024:0> string.encoding
=> #<Encoding:ISO-8859-1>
irb(main):025:0> string.encode!('utf-8')
=> "Norrlandsvägen"
irb(main):026:0> string.encoding
=> #<Encoding:UTF-8>
The above answer was spot on. Specifically this point here:
There's a difference between force_encoding and encode. The former
sets the encoding for the string, whereas the latter actually
transcodes the contents of the string to the new encoding.
In my situation, I had a text file with iso-8859-1 encoding. By default, Ruby uses UTF-8 encoding, so if you were to try to read the file without specifying the encoding, then you would get an error:
results = File.read(file)
results.encoding
=> #<Encoding:UTF-8>
results.split("\r\n")
ArgumentError: invalid byte sequence in UTF-8
You get an invalid byte sequence error because the characters in different encodings are represented by different byte lengths. Consequently, you would need to specify the encoding to the File API. Think of it like force_encoding:
results = File.read(file, encoding: "iso-8859-1")
So everything is good right? No, not if you want to start parsing the iso-8859-1 string with UTF-8 character encodings:
results = File.read(file, encoding: "iso-8859-1")
results.each do |line|
puts line.split('¬')
end
Encoding::CompatibilityError: incompatible character encodings: ISO-8859-1 and UTF-8
Why this error? Because '¬' is represented as UTF-8. You are using a UTF-8 character sequence against an ISO-8859-1 string. They are incompatible encodings. Consequently, after you read the File as a ISO-8859-1, then you can ask Ruby to encode that ISO-8859-1 into a UTF-8. And now you will be working with UTF-8 strings and thus no problems:
results = File.read(file, encoding: "iso-8859-1").encode('UTF-8')
results.encoding
results = results.split("\r\n")
results.each do |line|
puts line.split('¬')
end
Ultimately, with some Ruby APIs, you do not need to use force_encoding('ISO-8859-1'). Instead, you just specify the expected encoding to the API. However, you must convert it back to UTF-8 if you plan to parse it with UTF-8 strings.

Ruby CSV UTF8 encoding error while reading

This is what I was doing:
csv = CSV.open(file_name, "r")
I used this for testing:
line = csv.shift
while not line.nil?
puts line
line = csv.shift
end
And I ran into this:
ArgumentError: invalid byte sequence in UTF-8
I read the answer here and this is what I tried
csv = CSV.open(file_name, "r", encoding: "windows-1251:utf-8")
I ran into the following error:
Encoding::UndefinedConversionError: "\x98" to UTF-8 in conversion from Windows-1251 to UTF-8
Then I came across a Ruby gem - charlock_holmes. I figured I'd try using it to find the source encoding.
CharlockHolmes::EncodingDetector.detect(File.read(file_name))
=> {:type=>:text, :encoding=>"windows-1252", :confidence=>37, :language=>"fr"}
So I did this:
csv = CSV.open(file_name, "r", encoding: "windows-1252:utf-8")
And still got this:
Encoding::UndefinedConversionError: "\x8F" to UTF-8 in conversion from Windows-1252 to UTF-8
It looks like you have problem with detecting the valid encoding of your file. CharlockHolmes provide you with useful tip of :confidence=>37 which simply means the detected encoding may not be the right one.
Basing on error messages and test_transcode.rb from https://github.com/MacRuby/MacRuby/blob/master/test-mri/test/ruby/test_transcode.rb I found the encoding that passes through both of your error messages. With help of String#encode it's easy to test:
"\x8F\x98".encode("UTF-8","cp1256") # => "ڈک"
Your issue looks like strictly related to the file and not to ruby.
In case we are not sure which encoding to use and can agree to loose some character we can use :invalid and :undef params for String#encode, in this case:
"\x8F\x98".encode("UTF-8", "CP1250",:invalid => :replace, :undef => :replace, :replace => "?") # => "Ź?"
other way is to use Iconv *//IGNORE option for target encoding:
Iconv.iconv("UTF-8//IGNORE","CP1250", "\x8F\x98")
As a source encoding suggestion of CharlockHolmes should be pretty good.
PS. String.encode was introduced in ruby 1.9. With ruby 1.8 you can use Iconv

Delete non-UTF characters from a string in Ruby?

How do I delete non-UTF8 characters from a ruby string? I have a string that has for example "xC2" in it. I want to remove that char from the string so that it becomes a valid UTF8.
This:
text.gsub!(/\xC2/, '')
returns an error:
incompatible encoding regexp match (ASCII-8BIT regexp with UTF-8 string)
I was looking at text.unpack('U*') and string.pack as well, but did not get anywhere.
You can use encode for that.
text.encode('UTF-8', :invalid => :replace, :undef => :replace)
For more info look into Ruby-Docs
You could do it like this
# encoding: utf-8
class String
def validate_encoding
chars.select(&:valid_encoding?).join
end
end
puts "testing\xC2 a non UTF-8 string".validate_encoding
#=>testing a non UTF-8 string
You text have ASCII-8BIT encoding, instead you should use this:
String.delete!("^\u{0000}-\u{007F}");
It will serve the same purpose.
You can use /n, as in
text.gsub!(/\xC2/n, '')
to force the Regexp to operate on bytes.
Are you sure this is what you want, though? Any Unicode character in the range [U+80, U+BF] will have a \xC2 in its UTF-8 encoded form.
Try Iconv
1.9.3p194 :001 > require 'iconv'
# => true
1.9.3p194 :002 > string = "testing\xC2 a non UTF-8 string"
# => "testing\xC2 a non UTF-8 string"
1.9.3p194 :003 > ic = Iconv.new('UTF-8//IGNORE', 'UTF-8')
# => #<Iconv:0x000000026c9290>
1.9.3p194 :004 > ic.iconv string
# => "testing a non UTF-8 string"
The best solution to this problem that I've found is this answer to the same question: https://stackoverflow.com/a/8711118/363293.
In short: "€foo\xA0".chars.select(&:valid_encoding?).join
data = '' if not (data.force_encoding("UTF-8").valid_encoding?)

ruby 1.9: invalid byte sequence in UTF-8

I'm writing a crawler in Ruby (1.9) that consumes lots of HTML from a lot of random sites.
When trying to extract links, I decided to just use .scan(/href="(.*?)"/i) instead of nokogiri/hpricot (major speedup). The problem is that I now receive a lot of "invalid byte sequence in UTF-8" errors.
From what I understood, the net/http library doesn't have any encoding specific options and the stuff that comes in is basically not properly tagged.
What would be the best way to actually work with that incoming data? I tried .encode with the replace and invalid options set, but no success so far...
In Ruby 1.9.3 it is possible to use String.encode to "ignore" the invalid UTF-8 sequences. Here is a snippet that will work both in 1.8 (iconv) and 1.9 (String#encode) :
require 'iconv' unless String.method_defined?(:encode)
if String.method_defined?(:encode)
file_contents.encode!('UTF-8', 'UTF-8', :invalid => :replace)
else
ic = Iconv.new('UTF-8', 'UTF-8//IGNORE')
file_contents = ic.iconv(file_contents)
end
or if you have really troublesome input you can do a double conversion from UTF-8 to UTF-16 and back to UTF-8:
require 'iconv' unless String.method_defined?(:encode)
if String.method_defined?(:encode)
file_contents.encode!('UTF-16', 'UTF-8', :invalid => :replace, :replace => '')
file_contents.encode!('UTF-8', 'UTF-16')
else
ic = Iconv.new('UTF-8', 'UTF-8//IGNORE')
file_contents = ic.iconv(file_contents)
end
The accepted answer nor the other answer work for me. I found this post which suggested
string.encode!('UTF-8', 'binary', invalid: :replace, undef: :replace, replace: '')
This fixed the problem for me.
My current solution is to run:
my_string.unpack("C*").pack("U*")
This will at least get rid of the exceptions which was my main problem
Try this:
def to_utf8(str)
str = str.force_encoding('UTF-8')
return str if str.valid_encoding?
str.encode("UTF-8", 'binary', invalid: :replace, undef: :replace, replace: '')
end
I recommend you to use a HTML parser. Just find the fastest one.
Parsing HTML is not as easy as it may seem.
Browsers parse invalid UTF-8 sequences, in UTF-8 HTML documents, just putting the "�" symbol. So once the invalid UTF-8 sequence in the HTML gets parsed the resulting text is a valid string.
Even inside attribute values you have to decode HTML entities like amp
Here is a great question that sums up why you can not reliably parse HTML with a regular expression:
RegEx match open tags except XHTML self-contained tags
attachment = file.read
begin
# Try it as UTF-8 directly
cleaned = attachment.dup.force_encoding('UTF-8')
unless cleaned.valid_encoding?
# Some of it might be old Windows code page
cleaned = attachment.encode( 'UTF-8', 'Windows-1252' )
end
attachment = cleaned
rescue EncodingError
# Force it to UTF-8, throwing out invalid bits
attachment = attachment.force_encoding("ISO-8859-1").encode("utf-8", replace: nil)
end
This seems to work:
def sanitize_utf8(string)
return nil if string.nil?
return string if string.valid_encoding?
string.chars.select { |c| c.valid_encoding? }.join
end
I've encountered string, which had mixings of English, Russian and some other alphabets, which caused exception. I need only Russian and English, and this currently works for me:
ec1 = Encoding::Converter.new "UTF-8","Windows-1251",:invalid=>:replace,:undef=>:replace,:replace=>""
ec2 = Encoding::Converter.new "Windows-1251","UTF-8",:invalid=>:replace,:undef=>:replace,:replace=>""
t = ec2.convert ec1.convert t
While Nakilon's solution works, at least as far as getting past the error, in my case, I had this weird f-ed up character originating from Microsoft Excel converted to CSV that was registering in ruby as a (get this) cyrillic K which in ruby was a bolded K. To fix this I used 'iso-8859-1' viz. CSV.parse(f, :encoding => "iso-8859-1"), which turned my freaky deaky cyrillic K's into a much more manageable /\xCA/, which I could then remove with string.gsub!(/\xCA/, '')
Before you use scan, make sure that the requested page's Content-Type header is text/html, since there can be links to things like images which are not encoded in UTF-8. The page could also be non-html if you picked up a href in something like a <link> element. How to check this varies on what HTTP library you are using. Then, make sure the result is only ascii with String#ascii_only? (not UTF-8 because HTML is only supposed to be using ascii, entities can be used otherwise). If both of those tests pass, it is safe to use scan.
There is also the scrub method to filter invalid bytes.
string.scrub('')
If you don't "care" about the data you can just do something like:
search_params = params[:search].valid_encoding? ? params[:search].gsub(/\W+/, '') : "nothing"
I just used valid_encoding? to get passed it. Mine is a search field, and so i was finding the same weirdness over and over so I used something like: just to have the system not break. Since i don't control the user experience to autovalidate prior to sending this info (like auto feedback to say "dummy up!") I can just take it in, strip it out and return blank results.

Resources