UTF-8 Error in Ruby - ruby

I'm scraping a few websites and eventually I hit a UTF-8 error that looks like this:
/usr/local/lib/ruby/gems/1.9.1/gems/dm-core-1.2.0/lib/dm-core/support/ext/blank.rb:19:in
`=~': invalid byte sequence in UTF-8 (ArgumentError)
Now, I don't care about the websites being 100% accurate. Is there a way I can take the page I get and strip out any problem encodings and then pass it around inside my program?
I'm using ruby 1.9.3p0 (2011-10-30 revision 33570) [x86_64-darwin11.2.0] if that matters.
Update:
def self.blank?(value)
return value.blank? if value.respond_to?(:blank?)
case value
when ::NilClass, ::FalseClass
true
when ::TrueClass, ::Numeric
false
when ::Array, ::Hash
value.empty?
when ::String
value !~ /\S/ ###This is the line 19 that has the issue.
else
value.nil? || (value.respond_to?(:empty?) && value.empty?)
end
end
end
When I try to save the following line:
What Happens in The Garage Tin Sign2. � � Newsletter Our monthly newsletter,
It throws the error. It's on page: http://www.stationbay.com/. But what is odd is that when I view it in my web browser it doesn't show the funny symbols in the source.
What do I do next?

The problem is that your string contains non-UTF-8 characters, but seems to have UTF-8 encoding forced. The following short code demonstrates the issue:
a = "\xff"
a.force_encoding "utf-8"
a.valid_encoding? # returns false
a =~ /x/ # provokes ArgumentError: invalid byte sequence in UTF-8
The best way to fix this is to apply the proper encoding right from the beginning. If this is not an option, you can use String#encode:
a = "\xff"
a.force_encoding "utf-8"
a.valid_encoding? # returns false
a.encode!("utf-8", "utf-8", :invalid => :replace)
a.valid_encoding? # returns true now
a ~= /x/ # works now

Related

ruby: How comes File.read in utf-8 fails to read a valid utf-8 character?

Context
Trying to parse with ruby 2.6.3+ a csv file utf-8 encoded
The file contains one character that throws a CSV::MalformedCSVError
After isolating, the character of interest seems to be the SINGLE LOW-9 QUOTATION MARK (U+201A). This one ‚.
That character seems to be supported by UTF-8 (3 bytes, in hex: e2809a or \xe2\x80\x9a) (ref1, ref2)
test.csv file content
"value"
"abc‚d"
Error when parsing with CSV class:
irb(main):001:0> require 'csv'
=> true
irb(main):002:0> CSV.read("test.csv", **{headers: true, skip_blanks: true, encoding: "utf-8"})
Traceback (most recent call last):
2: from C:/ruby/Ruby26-x64/lib/ruby/2.6.0/csv/parser.rb:297:in `parse'
1: from C:/ruby/Ruby26-x64/lib/ruby/2.6.0/csv/parser.rb:711:in `build_scanner'
CSV::MalformedCSVError (Invalid byte sequence in UTF-8 in line 2.)
although \u201a is a utf-8 supported character:
irb(main):001:0> str = "\u201a"
=> "‚"
irb(main):002:0> str.ord.to_s(16)
=> "201a"
irb(main):003:0> str.encoding.name
=> "UTF-8"
irb(main):004:0> str.unpack("H*")
=> ["e2809a"]
Treating encoding errors with scrub
Using String#scrub with a block does not seem to get the SINGLE LOW-9 QUOTATION MARK. See the code below (x82 refers to the original ‚):
irb(main):001:0> content = File.read('test.csv', encoding: 'utf-8')
=> "\"value\"\n\"abc\x82d\"\n"
irb(main):002:0> content.scrub! do |bytes|
irb(main):003:1* "<" + bytes.unpack('H*')[0] + ">"
irb(main):004:1> end
=> "\"value\"\n\"abc<82>d\"\n"
irb(main):005:0> require 'csv'
=> true
irb(main):006:0> CSV.parse(content, **{headers: true, skip_blanks: true, encoding: "utf-8"}).each do |row|
irb(main):007:1* pp row["value"]
irb(main):008:1> end
"abc<82>d"
=> #<CSV::Table mode:col_or_row row_count:2>
As shown above, the SINGLE LOW-9 QUOTATION MARK is replaced by <82>, the hexadecimal representation of the offending bytes. At this point, I am a bit lost.
Question
While it seems consistent that there is a CSV::MalformatedCSVError and that String#scrub fails to find the 3 bytes of the UTF-8 e2809a character (getting one byte instead: 0x82):
If the character \u201a is supported by UTF-8 (as e2809a), what can be the root cause of the error?
Opening the csv file in a text editor (Notepad++), the character is correctly displayed on the text area. When I copy and paste the character (‚) in this Unicode Lookup, it correctly identifies the character. So nothing seems to be wrong with it.
Looking at the code sample above (irb), nothing seems wrong with the usage of CSV.read and File.read. Could you confirm (explain) or discard if the error is related to the usage of those methods?
Perhaps there is nothing to fix, not to be expected, but I am not sure. It seems fairly difficult, if not impossible, to create a generic solution in ruby that correctly spots the offending character.

RSpec - inspected result must be ASCII only [duplicate]

The following code works without problem:
#encoding: utf-8
class Text
def initialize(txt)
#txt = txt
end
def inspect
"<Text: %s>" % #txt
end
end
p Text.new('Hello World')
But if I try p Text.new('Hä, was soll das?') I get a Encoding::CompatibilityError:
inspect_with_umlaut.rb:26:in `p': inspected result must be ASCII only or use the default external encoding (Encoding::CompatibilityError)
from inspect_with_umlaut.rb:26:in `<main>'
Why this?
And more important: How can I avoid it?
The error message explains already the why:
inspected result must be ASCII only or use the default external encoding
In this case the inspect-command gets a UTF-8 character (Not ASCII), but the default encoding seems to be another.
The default encoding can be read in Encoding.default_external.
To avoid the error you must encode the result of inspect:
#encoding: utf-8
class Text
def initialize(txt)
#txt = txt
end
def inspect
#force ASCII and replace invalid/undefined characters
("<Text: %s>" % #txt).encode('ASCII', :undef => :replace, :invalid => :replace)
end
end
p Text.new('Hä, was soll das?') #-> <Text: H?, was soll das?>
Instead of ASCII in encode you can use also Encoding.default_external:
("<Text: %s>" % #txt).encode(Encoding.default_external, :undef => :replace)

Ruby converting string encoding from ISO-8859-1 to UTF-8 not working

I am trying to convert a string from ISO-8859-1 encoding to UTF-8 but I can't seem to get it work. Here is an example of what I have done in irb.
irb(main):050:0> string = 'Norrlandsvägen'
=> "Norrlandsvägen"
irb(main):051:0> string.force_encoding('iso-8859-1')
=> "Norrlandsv\xC3\xA4gen"
irb(main):052:0> string = string.encode('utf-8')
=> "Norrlandsvägen"
I am not sure why Norrlandsvägen in iso-8859-1 will be converted into Norrlandsvägen in utf-8.
I have tried encode, encode!, encode(destinationEncoding, originalEncoding), iconv, force_encoding, and all kinds of weird work-arounds I could think of but nothing seems to work. Can someone please help me/point me in the right direction?
Ruby newbie still pulling hair like crazy but feeling grateful for all the replies here... :)
Background of this question: I am writing a gem that will download an xml file from some websites (which will have iso-8859-1 encoding) and save it in a storage and I would like to convert it to utf-8 first. But words like Norrlandsvägen keep messing me up. Really any help would be greatly appreciated!
[UPDATE]: I realized running tests like this in the irb console might give me different behaviors so here is what I have in my actual code:
def convert_encoding(string, originalEncoding)
puts "#{string.encoding}" # ASCII-8BIT
string.encode(originalEncoding)
puts "#{string.encoding}" # still ASCII-8BIT
string.encode!('utf-8')
end
but the last line gives me the following error:
Encoding::UndefinedConversionError - "\xC3" from ASCII-8BIT to UTF-8
Thanks to #Amadan's answer below, I noticed that \xC3 actually shows up in irb if you run:
irb(main):001:0> string = 'ä'
=> "ä"
irb(main):002:0> string.force_encoding('iso-8859-1')
=> "\xC3\xA4"
I have also tried to assign a new variable to the result of string.encode(originalEncoding) but got an even weirder error:
newString = string.encode(originalEncoding)
puts "#{newString.encoding}" # can't even get to this line...
newString.encode!('utf-8')
and the error is Encoding::UndefinedConversionError - "\xC3" to UTF-8 in conversion from ASCII-8BIT to UTF-8 to ISO-8859-1
I am still quite lost in all of this encoding mess but I am really grateful for all the replies and help everyone has given me! Thanks a ton! :)
You assign a string, in UTF-8. It contains ä. UTF-8 represents ä with two bytes.
string = 'ä'
string.encoding
# => #<Encoding:UTF-8>
string.length
# 1
string.bytes
# [195, 164]
Then you force the bytes to be interpreted as if they were ISO-8859-1, without actually changing the underlying representation. This does not contain ä any more. It contains two characters, Ã and ¤.
string.force_encoding('iso-8859-1')
# => "\xC3\xA4"
string.length
# 2
string.bytes
# [195, 164]
Then you translate that into UTF-8. Since this is not reinterpretation but translation, you keep the two characters, but now encoded in UTF-8:
string = string.encode('utf-8')
# => "ä"
string.length
# 2
string.bytes
# [195, 131, 194, 164]
What you are missing is the fact that you originally don't have an ISO-8859-1 string, as you would from your Web-service - you have gibberish. Fortunately, this is all in your console tests; if you read the response of the website using the proper input encoding, it should all work okay.
For your console test, let's demonstrate that if you start with a proper ISO-8859-1 string, it all works:
string = 'Norrlandsvägen'.encode('iso-8859-1')
# => "Norrlandsv\xE4gen"
string = string.encode('utf-8')
# => "Norrlandsvägen"
EDIT For your specific problem, this should work:
require 'net/https'
uri = URI.parse("https://rusta.easycruit.com/intranet/careerbuilder_se/export/xml/full")
options = {
:use_ssl => uri.scheme == 'https',
:verify_mode => OpenSSL::SSL::VERIFY_NONE
}
response = Net::HTTP.start(uri.host, uri.port, options) do |https|
https.request(Net::HTTP::Get.new(uri.path))
end
body = response.body.force_encoding('ISO-8859-1').encode('UTF-8')
There's a difference between force_encoding and encode. The former sets the encoding for the string, whereas the latter actually transcodes the contents of the string to the new encoding. Consequently, the following code causes your problem:
string = "Norrlandsvägen"
string.force_encoding('iso-8859-1')
puts string.encode('utf-8') # Norrlandsvägen
Whereas the following code will actually correctly encode your contents:
string = "Norrlandsvägen".encode('iso-8859-1')
string.encode!('utf-8')
Here's an example running in irb:
irb(main):023:0> string = "Norrlandsvägen".encode('iso-8859-1')
=> "Norrlandsv\xE4gen"
irb(main):024:0> string.encoding
=> #<Encoding:ISO-8859-1>
irb(main):025:0> string.encode!('utf-8')
=> "Norrlandsvägen"
irb(main):026:0> string.encoding
=> #<Encoding:UTF-8>
The above answer was spot on. Specifically this point here:
There's a difference between force_encoding and encode. The former
sets the encoding for the string, whereas the latter actually
transcodes the contents of the string to the new encoding.
In my situation, I had a text file with iso-8859-1 encoding. By default, Ruby uses UTF-8 encoding, so if you were to try to read the file without specifying the encoding, then you would get an error:
results = File.read(file)
results.encoding
=> #<Encoding:UTF-8>
results.split("\r\n")
ArgumentError: invalid byte sequence in UTF-8
You get an invalid byte sequence error because the characters in different encodings are represented by different byte lengths. Consequently, you would need to specify the encoding to the File API. Think of it like force_encoding:
results = File.read(file, encoding: "iso-8859-1")
So everything is good right? No, not if you want to start parsing the iso-8859-1 string with UTF-8 character encodings:
results = File.read(file, encoding: "iso-8859-1")
results.each do |line|
puts line.split('¬')
end
Encoding::CompatibilityError: incompatible character encodings: ISO-8859-1 and UTF-8
Why this error? Because '¬' is represented as UTF-8. You are using a UTF-8 character sequence against an ISO-8859-1 string. They are incompatible encodings. Consequently, after you read the File as a ISO-8859-1, then you can ask Ruby to encode that ISO-8859-1 into a UTF-8. And now you will be working with UTF-8 strings and thus no problems:
results = File.read(file, encoding: "iso-8859-1").encode('UTF-8')
results.encoding
results = results.split("\r\n")
results.each do |line|
puts line.split('¬')
end
Ultimately, with some Ruby APIs, you do not need to use force_encoding('ISO-8859-1'). Instead, you just specify the expected encoding to the API. However, you must convert it back to UTF-8 if you plan to parse it with UTF-8 strings.

ruby 1.9 character conversion errors while testing regex

I know there are a tons of docs and debates out there, but still:
This is my best shot on my Rails attempt to test scraped data from various websites. Strange fact is that if I manually copy-paste the source of an URL everything goes right.
What can I do?
# encoding: utf-8
require 'rubygems'
require 'iconv'
require 'nokogiri'
require 'open-uri'
require 'uri'
url = 'http://www.website.com/url/test'
sio = open(url)
#cur_encoding = sio.charset
doc = Nokogiri::HTML(sio, nil, #cur_encoding)
txtdoc = doc.to_s
# 1) String manipulation test
p doc.search('h1')[0].text # "Nove36  "
p doc.search('h1')[0].text.strip! # nil <- ERROR
# 2) Regex test
# txtdoc = "test test 44.00 € test test" # <- THIS WORKS
regex = "[0-9.]+ €"
p /#{regex}/i =~ txtdoc # integer expected
I realize that probably my OS Ubuntu plus my text editor is doing some good encoding conversion over probably some broken encoding: that's fine, BUT how can I fix this problem on my app while running live?
#cur_encoding = doc.encoding # ISO-8859-15
ISO-8859-15 is not the correct encoding for the quoted page; it should have been UTF-8. iconving it to UTF-8 as if it were 8859-15 only compounds the problem.
This encoding is coming from a faulty <meta> tag in the document. A browser will ignore that tag and use the overriding encoding from the Content-Type: text/html;charset=utf-8 HTTP response header.
However Nokogiri appears not to be able to read this header from the open()ed stream. With the caveat that I know nothing about Ruby, looking at the source the problem would seem to be that it uses the property encoding from the string-or-IO instead of charset which seems to be what open-uri writes.
You can pass in an override encoding of your own, so I guess try:
sio= open(url)
doc= Nokogiri::HTML.parse(doc, nil, sio.charset) # should be UTF-8?
The problems you're having are caused by non breaking space characters (Unicode U+00A0) in the page.
In your first problem, the string:
"Nove36 "
actually ends with U+00A0, and String#strip! doesn't consider this character to be whitespace to be removed:
1.9.3-p125 :001 > s = "Foo \u00a0"
=> "Foo  "
1.9.3-p125 :002 > s.strip
=> "Foo  " #unchanged
In your second problem, the space between the price and the euro sign is again a non breaking space, so the regex simply doesn't match as it is looking for a normal space:
# s as before
1.9.3-p125 :003 > s =~ /Foo / #2 spaces, no match
=> nil
1.9.3-p125 :004 > s =~ /Foo / #1 space, match
=> 0
1.9.3-p125 :005 > s =~ /Foo \u00a0/ #space and non breaking space, match
=> 0
When you copy and paste the source, the browser probably normalises the non breaking spaces, so you only copy normal space character, which is why it works that way.
The simplest fix would be to do a global substitution of \u00a0 for space before you start processing:
sio = open(url)
#cur_encoding = sio.charset
txt = sio.read #read the whole file
txt.gsub! "\u00a0", " " #global replace
doc = Nokogiri::HTML(txt, nil, #cur_encoding) #use this new string instead...

ruby 1.9: invalid byte sequence in UTF-8

I'm writing a crawler in Ruby (1.9) that consumes lots of HTML from a lot of random sites.
When trying to extract links, I decided to just use .scan(/href="(.*?)"/i) instead of nokogiri/hpricot (major speedup). The problem is that I now receive a lot of "invalid byte sequence in UTF-8" errors.
From what I understood, the net/http library doesn't have any encoding specific options and the stuff that comes in is basically not properly tagged.
What would be the best way to actually work with that incoming data? I tried .encode with the replace and invalid options set, but no success so far...
In Ruby 1.9.3 it is possible to use String.encode to "ignore" the invalid UTF-8 sequences. Here is a snippet that will work both in 1.8 (iconv) and 1.9 (String#encode) :
require 'iconv' unless String.method_defined?(:encode)
if String.method_defined?(:encode)
file_contents.encode!('UTF-8', 'UTF-8', :invalid => :replace)
else
ic = Iconv.new('UTF-8', 'UTF-8//IGNORE')
file_contents = ic.iconv(file_contents)
end
or if you have really troublesome input you can do a double conversion from UTF-8 to UTF-16 and back to UTF-8:
require 'iconv' unless String.method_defined?(:encode)
if String.method_defined?(:encode)
file_contents.encode!('UTF-16', 'UTF-8', :invalid => :replace, :replace => '')
file_contents.encode!('UTF-8', 'UTF-16')
else
ic = Iconv.new('UTF-8', 'UTF-8//IGNORE')
file_contents = ic.iconv(file_contents)
end
The accepted answer nor the other answer work for me. I found this post which suggested
string.encode!('UTF-8', 'binary', invalid: :replace, undef: :replace, replace: '')
This fixed the problem for me.
My current solution is to run:
my_string.unpack("C*").pack("U*")
This will at least get rid of the exceptions which was my main problem
Try this:
def to_utf8(str)
str = str.force_encoding('UTF-8')
return str if str.valid_encoding?
str.encode("UTF-8", 'binary', invalid: :replace, undef: :replace, replace: '')
end
I recommend you to use a HTML parser. Just find the fastest one.
Parsing HTML is not as easy as it may seem.
Browsers parse invalid UTF-8 sequences, in UTF-8 HTML documents, just putting the "�" symbol. So once the invalid UTF-8 sequence in the HTML gets parsed the resulting text is a valid string.
Even inside attribute values you have to decode HTML entities like amp
Here is a great question that sums up why you can not reliably parse HTML with a regular expression:
RegEx match open tags except XHTML self-contained tags
attachment = file.read
begin
# Try it as UTF-8 directly
cleaned = attachment.dup.force_encoding('UTF-8')
unless cleaned.valid_encoding?
# Some of it might be old Windows code page
cleaned = attachment.encode( 'UTF-8', 'Windows-1252' )
end
attachment = cleaned
rescue EncodingError
# Force it to UTF-8, throwing out invalid bits
attachment = attachment.force_encoding("ISO-8859-1").encode("utf-8", replace: nil)
end
This seems to work:
def sanitize_utf8(string)
return nil if string.nil?
return string if string.valid_encoding?
string.chars.select { |c| c.valid_encoding? }.join
end
I've encountered string, which had mixings of English, Russian and some other alphabets, which caused exception. I need only Russian and English, and this currently works for me:
ec1 = Encoding::Converter.new "UTF-8","Windows-1251",:invalid=>:replace,:undef=>:replace,:replace=>""
ec2 = Encoding::Converter.new "Windows-1251","UTF-8",:invalid=>:replace,:undef=>:replace,:replace=>""
t = ec2.convert ec1.convert t
While Nakilon's solution works, at least as far as getting past the error, in my case, I had this weird f-ed up character originating from Microsoft Excel converted to CSV that was registering in ruby as a (get this) cyrillic K which in ruby was a bolded K. To fix this I used 'iso-8859-1' viz. CSV.parse(f, :encoding => "iso-8859-1"), which turned my freaky deaky cyrillic K's into a much more manageable /\xCA/, which I could then remove with string.gsub!(/\xCA/, '')
Before you use scan, make sure that the requested page's Content-Type header is text/html, since there can be links to things like images which are not encoded in UTF-8. The page could also be non-html if you picked up a href in something like a <link> element. How to check this varies on what HTTP library you are using. Then, make sure the result is only ascii with String#ascii_only? (not UTF-8 because HTML is only supposed to be using ascii, entities can be used otherwise). If both of those tests pass, it is safe to use scan.
There is also the scrub method to filter invalid bytes.
string.scrub('')
If you don't "care" about the data you can just do something like:
search_params = params[:search].valid_encoding? ? params[:search].gsub(/\W+/, '') : "nothing"
I just used valid_encoding? to get passed it. Mine is a search field, and so i was finding the same weirdness over and over so I used something like: just to have the system not break. Since i don't control the user experience to autovalidate prior to sending this info (like auto feedback to say "dummy up!") I can just take it in, strip it out and return blank results.

Resources