incompatible character encodings: UTF-8 and ASCII-8BIT Ruby 1.9 - ruby

I have just recently upgraded to ruby 1.92 and one of my monkey patches is failing with some sort of encoding error. I have the following function:
def strip_noise()
return if (!self) || (self.size == 0)
self.delete(160.chr+194.chr).gsub(/[,]/, "").strip
end
That now gives me the following error:
incompatible character encodings:
UTF-8 and ASCII-8BIT
Has anyone else come across this?

This is working for me at the moment anyway:
class String
def strip_noise()
return if empty?
self.mb_chars.normalize(:kd).gsub(/[^\x00-\x7F]/n,'')
end
end
I need to do more testing but I can progress..

class String
def strip_noise
return if empty?
ActiveSupport::Inflector.transliterate self, ''
end
end
"#{160.chr}#{197.chr} string with noises" # => "\xA0\xC5 string with noises"
"#{160.chr}#{197.chr} string with noises".strip_noise # => "A string with noises"

This might not be exactly what you want:
def strip_noise
return if empty?
sub = 160.chr.force_encoding(encoding) + 194.chr.force_encoding(encoding)
delete(sub).gsub(/[,]/, "").strip
end
Read more on the topic here: http://yehudakatz.com/2010/05/17/encodings-unabridged/

It's not entirely clear what you're trying to do here, but 160.chr+194.chr is not valid UTF-8: 160 is a continuation byte, and 194 is the first byte of a 2-byte character. Reversed they form the unicode character for "non breaking space".
If you want to remove all non-ASCII-7 characters, try this:
s.delete!("^\u{0000}-\u{007F}")

Related

Ruby UTF-8 string to UCS-2 conversion

I have a UTF-8 string in my Ruby code. Due to limitations I want to convert the UTF-8 characters in that string to either their escaped equivalents (such as \u23) or simply convert the whole string to UCS-2. I need to explicitly do this to export the data to a file
I tried to do the following in IRB:
my_string = '7.0mΩ'
my_string.encoding
my_string.encode!(Encode::UCS_2BE)
my_string.encoding
The output of that is:
=> "7.0mΩ"
=> #<Encoding::UTF-8>
=> "7.0m\u2126"
=> #<Encoding::UTF-16BE>
This seemed to work fine (I got "ohm" as 2126) until I was reading data out of an array (in Rails):
data.each_with_index do |entry, idx|
puts "#{idx} !! #{entry['title']} !! #{entry['value']} !! #{entry['value'].encode!(Encoding::UCS_2BE)}"
end
That results in the error:
incompatible character encodings: UTF-8 and UTF-16BE
I then tried to write a basic file conversion routine:
File.open(target, 'w', encoding: Encoding::UCS_2BE) do |file|
File.open(source, 'r', encoding: Encoding::UTF_8).each_line do |line|
output.puts(line)
end
end
This resulted in all kinds of weird characters in the file.
Not sure what is going wrong.
Is there a better way to approach this problem of converting UTF-8 data to UCS-2 in Ruby? I really wouldn't mind this actually being changed in the string to \u2126 as a literal part of the string rather than the actual value.
Help!
Temporary Workaround
I monkey-patched this to do what I want. It's not very elegant, but it does the job (and yes, I know it's not pretty... it's just a hack to get what I need):
def hacky_encode
encoded = self
unless encoded.ascii_only?
encoded = scan(/./).map do |char|
char.ascii_only? ? char : char.unpack('U*').map { |i| '\\u' + i.to_s(16).rjust(4, '0') }
end.join
end
encoed
end
Which can be used:
"7.0mΩ".hacky_encode

How to create a string with a "bad encoding" in ruby?

I have a file somewhere out in production that I do not have access to that, when loaded by a ruby script, a regular expression against the contents fails with a ArgumentError => invalid byte sequence in UTF-8.
I believe I have a fix based on the answer with all the points here: ruby 1.9: invalid byte sequence in UTF-8
# Remove all invalid and undefined characters in the given string
# (ruby 1.9.3)
def safe_str str
# edited based on matt's comment (thanks matt)
s = str.encode('utf-16', 'utf-8', invalid: :replace, undef: :replace, replace: '')
s.encode!('utf-8', 'utf-16')
end
However, I now want to build my rspec to verify that the code works. I don't have access to the file that caused the problem so I want to create a string with the bad encoding programatically.
I've tried variations on things like:
bad_str = (100..1000).to_a.inject('') {|s,c| s << c; s}
bad_str.length.should > safe_str(bad_str).length
or,
bad_str = (100..1000).to_a.pack(c*)
bad_str.length.should > safe_str(bad_str).length
but the length is always the same. I have also tried different character ranges; not always 100 to 1000.
Any suggestions on how to build a string with an invalid encoding within a ruby 1.9.3 script?
Lots of one-byte strings will make an invalid UTF-8 string, starting with 0x80. So 128.chr should work.
Your safe_str method will (currently) never actually do anything to the string, it is a no-op. The docs for String#encode on Ruby 1.9.3 say:
Please note that conversion from an encoding enc to the same encoding enc is a no-op, i.e. the receiver is returned without any changes, and no exceptions are raised, even if there are invalid bytes.
This is true for the current release of 2.0.0 (patch level 247), however a recent commit to Ruby trunk changes this, and also introduces a scrub method that pretty much does what you want.
Until a new version of Ruby is released you will need to round trip your text string to another encoding and back to clean it, as in the second example in this answer to the question you linked to, something like:
def safe_str str
s = str.encode('utf-16', 'utf-8', invalid: :replace, undef: :replace, replace: '')
s.encode!('utf-8', 'utf-16')
end
Note that your first example of an attempt to create an invalid string won’t work:
bad_str = (100..1000).to_a.inject('') {|s,c| s << c; s}
bad_str.valid_encoding? # => true
From the << docs:
If the object is a Integer, it is considered as a codepoint, and is converted to a character before concatenation.
So you’ll always get a valid string.
Your second method, using pack will create a string with the encoding ASCII-8BIT. If you then change this using force_encoding you can create a UTF-8 string with an invalid encoding:
bad_str = (100..1000).to_a.pack('c*').force_encoding('utf-8')
bad_str.valid_encoding? # => false
Try with s = "hi \255"
s.valid_encoding?
# => false
Following example can be used for testing purposes:
describe TestClass do
let(:non_utf8_text) { "something\255 english." }
it 'is not raise error on invalid byte sequence string' do
expect(non_utf8_text).not_to be_valid_encoding
expect { subject.call(non_utf8_text) }.not_to raise_error
end
end
Thanks to Iwan B. for "\255" advise.
In spec tests I’ve written, I haven’t found a way to fix this bad encoding:
Period%Basics
The %B string consistently produces ArgumentError: invalid byte sequence in UTF-8.

Ruby `split': invalid byte sequence in UTF-8 (ArgumentError)

I am trying to populate the movie object, but when parsing through the u.item file I get this error:
`split': invalid byte sequence in UTF-8 (ArgumentError)
File.open("Data/u.item", "r") do |infile|
while line = infile.gets
line = line.split("|")
end
end
The error occurs only when trying to split the lines with fancy international punctuation.
Here's a sample
543|Misérables, Les (1995)|01-Jan-1995||http://us.imdb.com/M/title-exact?Mis%E9rables%2C%20Les%20%281995%29|0|0|0|0|0|0|0|0|1|0|0|0|1|0|0|0|0|0|0
Is there a work around??
I had to force the encoding of each line to iso-8859-1
(which is the European character set)... http://en.wikipedia.org/wiki/ISO/IEC_8859-1
a=[]
IO.foreach("u.item") {|x| a << x}
m=[]
a.each_with_index {|line,i| x=line.force_encoding("iso-8859-1").split("|"); m[i]=x}
Ruby is somewhat sensitive to character encoding issues. You can do a number of things that might solve your problem. For example:
Put an encoding comment at the top of your source file.
# encoding: utf-8
Explicitly encode your line before splitting.
line = line.encode('UTF-8').split("|")
Replace invalid characters, instead of raising an Encoding::InvalidByteSequenceError exception.
line.encode('UTF-8', :invalid => :replace).split("|")
Give these suggestions a shot, and update your question if none of them work for you. Hope it helps!

URI.unescape crashes as it is trying to convert "%C3%9Fą" to "ßą"

I am using URI.unescape to unescape text, unfortunately I run into weird error:
# encoding: utf-8
require('uri')
URI.unescape("%C3%9Fą")
results in
C:/Ruby193/lib/ruby/1.9.1/uri/common.rb:331:in `gsub': incompatible character encodings: ASCII-8BIT and UTF-8 (Encoding::CompatibilityError)
from C:/Ruby193/lib/ruby/1.9.1/uri/common.rb:331:in `unescape'
from C:/Ruby193/lib/ruby/1.9.1/uri/common.rb:649:in `unescape'
from exe/fail.rb:3:in `<main>'
why?
Don't know why but you can use CGI.unescape method:
# encoding: utf-8
require 'cgi'
CGI.unescape("%C3%9Fą")
The implementation of URI.unescape is broken for non-ASCII inputs. The 1.9.3 version looks like this:
def unescape(str, escaped = #regexp[:ESCAPED])
str.gsub(escaped) { [$&[1, 2].hex].pack('C') }.force_encoding(str.encoding)
end
The regex in use is /%[a-fA-F\d]{2}/. So it goes through the string looking for a percent sign followed by two hex digits; in the block $& will be the matched text ('%C3' for example) and $&[1,2] be the matched text without the leading percent sign ('C3'). Then we call String#hex to convert that hexadecimal number to a Fixnum (195) and wrap it in an Array ([195]) so that we can use Array#pack to do the byte mangling for us. The problem is that pack gives us a single binary byte:
> puts [195].pack('C').encoding
ASCII-8BIT
The ASCII-8BIT encoding is also known as "binary" (i.e. plain bytes with no particular encoding). Then the block returns that byte and String#gsub tries to insert into the UTF-8 encoded copy of str that gsub is working on and you get your error:
incompatible character encodings: ASCII-8BIT and UTF-8 (Encoding::CompatibilityError)
because you can't (in general) just stuff binary bytes into a UTF-8 string; you can often get away with it:
URI.unescape("%C3%9F") # Works
URI.unescape("%C3µ") # Fails
URI.unescape("µ") # Works, but nothing to gsub here
URI.unescape("%C3%9Fµ") # Fails
URI.unescape("%C3%9Fpancakes") # Works
Things start falling apart once you start mixing non-ASCII data into your URL encoded string.
One simple fix is to switch the string to binary before try to decode it:
def unescape(str, escaped = #regexp[:ESCAPED])
encoding = str.encoding
str = str.dup.force_encoding('binary')
str.gsub(escaped) { [$&[1, 2].hex].pack('C') }.force_encoding(encoding)
end
Another option is to push the force_encoding into the block:
def unescape(str, escaped = #regexp[:ESCAPED])
str.gsub(escaped) { [$&[1, 2].hex].pack('C').force_encoding(encoding) }
end
I'm not sure why the gsub fails in some cases but succeeds in others.
To expand on Vasiliy's answer that suggests using CGI.unescape:
As of Ruby 2.5.0, URI.unescape is obsolete.
See https://ruby-doc.org/stdlib-2.5.0/libdoc/uri/rdoc/URI/Escape.html#method-i-unescape.
"This method is obsolete and should not be used. Instead, use CGI.unescape, URI.decode_www_form or URI.decode_www_form_component depending on your specific use case."

Is there a way to decode q-encoded strings in Ruby?

I'm working with mails, and names and subjects sometimes come q-encoded, like this:
=?UTF-8?Q?J=2E_Pablo_Fern=C3=A1ndez?=
Is there a way to decode them in Ruby? It seems TMail should take care of it, but it's not doing it.
I use this to parse email subjects:
You could try the following:
str = "=?UTF-8?Q?J=2E_Pablo_Fern=C3=A1ndez?="
if m = /=\?([A-Za-z0-9\-]+)\?(B|Q)\?([!->#-~]+)\?=/i.match(str)
case m[2]
when "B" # Base64 encoded
decoded = Base64.decode64(m[3])
when "Q" # Q encoded
decoded = m[3].unpack("M").first.gsub('_',' ')
else
p "Could not find keyword!!!"
end
Iconv.conv('utf-8',m[1],decoded) # to convert to utf-8
end
Ruby includes a method of decoding Quoted-Printable strings:
puts "Pablo_Fern=C3=A1ndez".unpack "M"
# => Pablo_Fernández
But this doesn't seem to work on your entire string (including the =?UTF-8?Q? part at the beginning. Maybe you can work it out from there, though.
This is a pretty old question but TMail::Unquoter (or its new incarnation Mail::Encodings) does the job as well.
TMail::Unquoter.unquote_and_convert_to(str, 'utf-8' )
or
Mail::Encodings.unquote_and_convert_to( str, 'utf-8' )
Decoding on a line-per-line basis:
line.unpack("M")
Convert STDIN or file provided input of encoded strings into a decoded output:
if ARGV[0]
lines = File.read(ARGV[0]).lines
else
lines = STDIN.each_line.to_a
end
puts lines.map { |c| c.unpack("M") }.join
This might help anyone wanting to test an email. delivery.html_part is normally encoded, but can be decoded to a straight HTML body using .decoded.
test "email test" do
UserMailer.confirm_email(user).deliver_now
assert_equal 1, ActionMailer::Base.deliveries.size
delivery = ActionMailer::Base.deliveries.last
assert_equal "Please confirm your email", delivery.subject
assert delivery.html_part.decoded =~ /Click the link below to confirm your email/ # DECODING HERE
end
The most efficient and up to date solution it seems to use the value_decode method of the Mail gem.
> Mail::Encodings.value_decode("=?UTF-8?Q?Greg_of_Google?=")
=> "Greg of Google"
https://www.rubydoc.info/github/mikel/mail/Mail/Encodings#value_decode-class_method
Below is Ruby code you can cut-and-paste, if inclined. It will run tests if executed directly with ruby, ruby ./copy-pasted.rb. As done in the code, I use this module as a refinement to the String core class.
A few remarks on the solution:
Other solutions perform .gsub('_', ' ') on the unpacked string. However, I do not believe this is correct, and can result in an incorrect decoding depending on the charsets. RFC2047 Section 4.2 (2) indicates "_ always represents hexidecimal 20", so it seems correct to first substitute =20 for _ then rely on the unpack result. (This also makes the implementation more elegant.) This is also discussed in an answer to a related question.
To be more instructive, I have written the regular expression in free-spacing mode to allow comments (I find this generally helpful for complex regular expressions). If you adjust the regular expression, take note that free-spacing mode changes the matching of white-space, which must then be done escaped or as a character class (as in the code). I've also added the regular expression on regex101, so you can read an explanation of the named capture groups, lazy quantifiers, etc. and experiment yourself.
The regular expression will absorb space ( ; but not TAB or newline) between multiple Q-encoded phrases in a single string, as shown with string test_4. This is because RFC2047 Section 5 (1) indicates that multiple Q encoded phrases must be separated from each other by linear white-space. Depending on your use-case, absorbing the white-space may not be desired.
The regular expression code named capture permits unexpected quoted printable codes (other than [bBqQ] so that a match will occur and the code can raise an error. This helps me to detect unexpected values when processing text. Change the regular expression named capture for code to [bBqQ] if you do not want this behaviour. (There will be no match and the original string will be returned.)
It makes use of the global Regexp.last_match as a convenience in the gsub block. You may need to take care if using this in multi-threaded code, I have not given this any consideration.
Additional references and reading:
https://en.wikipedia.org/wiki/Quoted-printable
https://en.wikipedia.org/wiki/MIME#Encoded-Word
require "minitest/autorun"
module QuotedPrintableDecode
class UnhandledCodeError < StandardError
def initialize(code)
super("Unhandled quoted printable code: '#{code}'.")
end
end
##qp_text_regex = %r{
=\? # Opening literal: `=?`
(?<charset>[^\?]+) # Character set, e.g. "Windows-1252" in `=?Windows-1252?`
\? # Literal: `?`
(?<code>[a-zA-Z]) # Encoding, e.g. "Q" in `?Q?` (`B`ase64); [BbQq] expected, others raise
\? # Literal: `?`
(?<text>[^\?]+?) # Encoded text, lazy (non-greedy) matched, e.g. "Foo_bar" in `?Foo_bar?`
\?= # Closing literal: `?=`
(?:[ ]+(?==\?))? # Optional separating linear whitespace if another Q-encode follows
}x # Free-spacing mode to allow above comments, also changes whitespace match
refine String do
def decode_q_p(to: "UTF-8")
self.gsub(##qp_text_regex) do
code, from, text = Regexp.last_match.values_at(:code, :charset, :text)
q_p_charset_to_charset(code, text, from, to)
end
end
private
def q_p_charset_to_charset(code, text, from, to)
case code
when "q", "Q"
text.gsub("_", "=20").unpack("M")
when "b", "B"
text.unpack("m")
else
raise UnhandledCodeError.new(code)
end.first.encode(to, from)
end
end
end
class TestQPDecode < Minitest::Test
using QuotedPrintableDecode
def test_decode_single_utf_8_phrase
encoded = "=?UTF-8?Q?J=2E_Pablo_Fern=C3=A1ndez?="
assert_equal encoded.decode_q_p, "J. Pablo Fernández"
end
def test_decoding_preserves_space_between_unencoded_phrase
encoded = "=?utf-8?Q?Alfred_Sanford?= <me#example.com>"
assert_equal encoded.decode_q_p, "Alfred Sanford <me#example.com>"
end
def test_decodinge_multiple_adjacent_phrases_absorbs_separating_whitespace
encoded = "=?Windows-1252?Q?Foo_-_D?= =?Windows-1252?Q?ocument_World=9617=96520;_Recor?= =?Windows-1252?Q?d_People_to_C?= =?Windows-1252?Q?anada's_History?="
assert_equal encoded.decode_q_p, "Foo - Document World–17–520; Record People to Canada's History"
end
def test_decoding_string_without_encoded_phrases_preserves_original
encoded = "Contains no QP phrases"
assert_equal encoded.decode_q_p, encoded
end
def test_unhandled_code_raises
klass = QuotedPrintableDecode::UnhandledCodeError
message = "Unhandled quoted printable code: 'Z'."
encoded = "=?utf-8?Z?Unhandled code Z?="
raised_error = assert_raises(klass) { encoded.decode_q_p }
assert_equal message, raised_error.message
end
end

Resources