ruby 1.9, force_encoding, but check

ruby 1.9, force_encoding, but check - ruby

I have a string I have read from some kind of input.
To the best of my knowledge, it is UTF8. Okay:
string.force_encoding("utf8")
But if this string has bytes in it that are not in fact legal UTF8, I want to know now and take action.
Ordinarily, will force_encoding("utf8") raise if it encounters such bytes? I believe it will not.
If I was doing an #encode I could choose from the handy options with what to do with characters that are invalid in the source encoding (or destination encoding).
But I'm not doing an #encode, I'm doing a #force_encoding. It has no such options.
Would it make sense to
string.force_encoding("utf8").encode("utf8")
to get an exception right away? Normally encoding from utf8 to utf8 doesn't make any sense. But maybe this is the way to get it to raise right away if there's invalid bytes? Or use the :replace option etc to do something different with invalid bytes?
But no, can't seem to make that work either.
Anyone know?
1.9.3-p0 :032 > a = "bad: \xc3\x28 okay".force_encoding("utf-8")
=> "bad: \xC3( okay"
1.9.3-p0 :033 > a.valid_encoding?
=> false
Okay, but how do I find and eliminate those bad bytes? Oddly, this does NOT raise:
1.9.3-p0 :035 > a.encode("utf-8")
=> "bad: \xC3( okay"
If I was converting to a different encoding, it would!
1.9.3-p0 :039 > a.encode("ISO-8859-1")
Encoding::InvalidByteSequenceError: "\xC3" followed by "(" on UTF-8
Or if I told it to, it'd replace it with a "?" =>
1.9.3-p0 :040 > a.encode("ISO-8859-1", :invalid => :replace)
=> "bad: ?( okay"
So ruby's got the smarts to know what are bad bytes in utf-8, and to replace em with something else -- when converting to a different encoding. But I don't want to convert to a different encoding, i want to stay utf8 -- but I might want to raise if there's an invalid byte in there, or I might want to replace invalid bytes with replacement chars.
Isn't there some way to get ruby to do this?
update I believe this has finally been added to ruby in 2.1, with String#scrub present in the 2.1 preview release to do this. So look for that!

(update: see https://github.com/jrochkind/scrub_rb)
So I coded up a solution to what I needed here: https://github.com/jrochkind/ensure_valid_encoding/blob/master/lib/ensure_valid_encoding.rb
But only much more recently did I realize this actually IS built into the stdlib, you just need to, somewhat counter-intuitively, pass 'binary' as the "source encoding":
a = "bad: \xc3\x28 okay".force_encoding("utf-8")
a.encode("utf-8", "binary", :undef => :replace)
=> "bad: �( okay"
Yep, that's exactly what I wanted. So turns out this IS built into 1.9 stdlib, it's just undocumented and few people know it (or maybe few people that speak English know it?). Although I saw these arguments used this way on a blog somewhere, so someone else knew it!

In ruby 2.1, the stdlib finally supports this with scrub.
http://ruby-doc.org/core-2.1.0/String.html#method-i-scrub

make sure that your scriptfile itself is saved as UTF8 and try the following
# encoding: UTF-8
p [a = "bad: \xc3\x28 okay", a.valid_encoding?]
p [a.force_encoding("utf-8"), a.valid_encoding?]
p [a.encode!("ISO-8859-1", :invalid => :replace), a.valid_encoding?]
This gives on my windows7 system the following
["bad: \xC3( okay", false]
["bad: \xC3( okay", false]
["bad: ?( okay", true]
So your bad char is replaced, you can do it right away as follows
a = "bad: \xc3\x28 okay".encode!("ISO-8859-1", :invalid => :replace)
=> "bad: ?( okay"
EDIT: here a solution that works on any arbitrary encoding, the first encodes only the bad chars, the second just replaces by a ?
def validate_encoding(str)
str.chars.collect do |c|
(c.valid_encoding?) ? c:c.encode!(Encoding.locale_charmap, :invalid => :replace)
end.join
end
def validate_encoding2(str)
str.chars.collect do |c|
(c.valid_encoding?) ? c:'?'
end.join
end
a = "bad: \xc3\x28 okay"
puts validate_encoding(a) #=>bad: ?( okay
puts validate_encoding(a).valid_encoding? #=>true
puts validate_encoding2(a) #=>bad: ?( okay
puts validate_encoding2(a).valid_encoding? #=>true

To check that a string has no invalid sequences, try to convert it to the binary encoding:
# Returns true if the string has only valid sequences
def valid_encoding?(string)
string.encode('binary', :undef => :replace)
true
rescue Encoding::InvalidByteSequenceError => e
false
end
p valid_encoding?("\xc0".force_encoding('iso-8859-1')) # true
p valid_encoding?("\u1111") # true
p valid_encoding?("\xc0".force_encoding('utf-8')) # false
This code replaces undefined characters, because we don't care if there are valid sequences that cannot be represented in binary. We only care if there are invalid sequences.
A slight modification to this code returns the actual error, which has valuable information about the improper encoding:
# Returns the encoding error, or nil if there isn't one.
def encoding_error(string)
string.encode('binary', :undef => :replace)
nil
rescue Encoding::InvalidByteSequenceError => e
e.to_s
end
# Returns truthy if the string has only valid sequences
def valid_encoding?(string)
!encoding_error(string)
end
puts encoding_error("\xc0".force_encoding('iso-8859-1')) # nil
puts encoding_error("\u1111") # nil
puts encoding_error("\xc0".force_encoding('utf-8')) # "\xC0" on UTF-8

About the only thing I can think of is to transcode to something and back that won't damage the string in the round-trip:
string.force_encoding("UTF-8").encode("UTF-32LE").encode("UTF-8")
Seems rather wasteful, though.

Okay, here's a really lame pure ruby way to do it I figured out myself. It probably performs for crap. what the heck, ruby? Not selecting my own answer for now, hoping someone else will show up and give us something better.
# Pass in a string, will raise an Encoding::InvalidByteSequenceError
# if it contains an invalid byte for it's encoding; otherwise
# returns an equivalent string.
#
# OR, like String#encode, pass in option `:invalid => :replace`
# to replace invalid bytes with a replacement string in the
# returned string. Pass in the
# char you'd like with option `:replace`, or will, like String#encode
# use the unicode replacement char if it thinks it's a unicode encoding,
# else ascii '?'.
#
# in any case, method will raise, or return a new string
# that is #valid_encoding?
def validate_encoding(str, options = {})
str.chars.collect do |c|
if c.valid_encoding?
c
else
unless options[:invalid] == :replace
# it ought to be filled out with all the metadata
# this exception usually has, but what a pain!
raise Encoding::InvalidByteSequenceError.new
else
options[:replace] || (
# surely there's a better way to tell if
# an encoding is a 'Unicode encoding form'
# than this? What's wrong with you ruby 1.9?
str.encoding.name.start_with?('UTF') ?
"\uFFFD" :
"?" )
end
end
end.join
end
More ranting at http://bibwild.wordpress.com/2012/04/17/checkingfixing-bad-bytes-in-ruby-1-9-char-encoding/

If you are doing this for a "real-life" use case - for example for parsing different strings entered by users, and not just for the sake of being able to "decode" a totally random file which could be made of as many encodings as you wish, then I guess you could at least assume that all charcters for each string have the same encoding.
Then, in this case, what would you think about this?
strings = [ "UTF-8 string with some utf8 chars \xC3\xB2 \xC3\x93",
"ISO-8859-1 string with some iso-8859-1 chars \xE0 \xE8", "..." ]
strings.each { |s|
s.force_encoding "utf-8"
if s.valid_encoding?
next
else
while s.valid_encoding? == false
s.force_encoding "ISO-8859-1"
s.force_encoding "..."
end
s.encode!("utf-8")
end
}
I am not a Ruby "pro" in any way, so please forgive if my solution is wrong or even a bit naive..
I just try to give back what I can, and this is what I've come to, while I was (I still am) working on this little parser for arbitrarily encoded strings, which I am doing for a study-project.
While I'm posting this, I must admit that I've not even fully tested it.. I.. just got a couple of "positive" results, but I felt so excited of possibly having found what I was struggling to find (and for all the time I spent reading about this on SO..) that I just felt the need to share it as quick as possible, hoping that it could help save some time to anyone who has been looking for this for as long as I've been... .. if it works as expected :)

A simple way to provoke an exception seems to be:
untrusted_string.match /./

Here are 2 common situations and how to deal with them in Ruby 2.1+. I know, the question refers to Ruby v1.9, but maybe this is helpful for others finding this question via Google.
Situation 1
You have an UTF-8 string with possibly a few invalid bytes
Remove the invalid bytes:
str = "Partly valid\xE4 UTF-8 encoding: äöüß"
str.scrub('')
# => "Partly valid UTF-8 encoding: äöüß"
Situation 2
You have a string that could be in either UTF-8 or ISO-8859-1 encoding
Check which encoding it is and convert to UTF-8 (if necessary):
str = "String in ISO-8859-1 encoding: \xE4\xF6\xFC\xDF"
unless str.valid_encoding?
str.encode!( 'UTF-8', 'ISO-8859-1', invalid: :replace, undef: :replace, replace: '?' )
end #unless
# => "String in ISO-8859-1 encoding: äöüß"
Notes
The above code snippets assume that Ruby encodes all your strings in UTF-8 by default. Even though, this is almost always the case, you can make sure of this by starting your scripts with # encoding: UTF-8.
If invalid, it is programmatically possible to detect most multi-byte encodings like UTF-8 (in Ruby, see: #valid_encoding?). However, it is NOT (easily) possible to programmatically detect invalidity of single-byte-encodings like ISO-8859-1. Thus the above code snippet does not work the other way around, i.e. detecting if a String is valid ISO-8859-1 encoding.
Even though UTF-8 has become increasingly popular as the default encoding in the web, ISO-8859-1 and other Latin1 flavors are still very popular in the Western countries, especially in North America. Be aware that there a several single-byte encodings out there that are very similar, but slightly vary from ISO-8859-1. Examples: CP1252 (a.k.a. Windows-1252), ISO-8859-15

Related

String replace with strange characters when attributed to a HASH

I'm testing 2 situations and getting 2 strangely different results.
First:
hash_data_file = CSV.parse(data_file).map {|line|
puts line[6]
abort
The return is Caixa Econômica Federal with accents in the right place.
Second:
hash_data_file = CSV.parse(data_file).map {|line|
puts :bank => line[6]
abort
But the return is {:bank=>"Caixa Econ\xC3\xB4mica Federal"}, a string with errors in the codification instead of the accents.
What am I doing wrong?

In the first case, your data_file is in UTF-8 encoding. In the second case, data_file has binary (i.e. 7-bit ASCII) encoding.
For example, if we start with a simple UTF-8 CSV file:
bank
Caixa Econômica Federal
and then parse it with UTF-8 encoding:
CSV.parse(File.open('pancakes.csv', encoding: 'utf-8'))
# [["bank"], ["Caixa Econômica Federal"]]
and then in binary encoding:
CSV.parse(File.open('pancakes.csv', encoding: 'binary'))
# [["bank"], ["Caixa Econ\xC3\xB4mica Federal"]]
So you need to fix the encoding by reading the file in the proper encoding. Hard to say more since we don't know how data_file is being opened.
Have a look at
line[6].encoding
and you should see #<Encoding:UTF-8> in the first case but #<Encoding:ASCII-8BIT> in the second.

There is no “error in codification.”
"Caixa Econ\xC3\xB4mica Federal" == "Caixa Econômica Federal"
#⇒ true
For some reason when printing out a hash, ruby uses this representation (I cannot reproduce it though,) but in a nutshell the string you see is good enough.

UTF-8 ruby encoding

I've got this string: WinterIDäSchwiiz, which comes from an API and I want to search for it in the database. Now it turns out that this string has a different encoding than how its saved in my database. Yet ruby says the encoding for both is utf-8. What is going on?
I've figured out the most terrible way to fix this problem by going down to the bytesequence and replace the bytes representing the "ä" with a different bytesequence and then forceencoding it to utf8. It works but hurts my eyes. Does anyone have a better solution than:
"WinterIDäSchwiiz".bytes.join(",").gsub("97,204,136","195,164").split(",").collect{|s| s.to_i}.pack('C*').force_encoding('utf-8')

Your string is UTF-8.
I can tell because your fix is to replace the bytes (97, 204, 136) with the bytes (195, 164).
The first byte you're replacing, 97 (0x61) is the UTF-8 character a. The second two bytes, 204 and 136 (0xCC 0x88), are the bytes for the UTF-8 character U+0308, the combining diaeresis: ̈. The two characters combine to form ä.
The bytes you're expecting are 195 and 164 (0xC3 0xA4) which, together, are U+00E4, or Latin small letter "a" with diaeresis.
Both are UTF-8. One prints ä and the other prints ä. This is an example of Unicode equivalence.
In other words:
str1 = "a\xCC\x88"
puts str1 # => ä
p str1.bytes # => [97, 204, 136]
p str1.encoding # => #<Encoding:UTF-8>
str2 = "\xC3\xA4"
puts str2 # => ä
p str2.bytes # => [195, 164]
p str2.encoding # => #<Encoding:UTF-8>
Fortunately, we have Unicode normalization to help deal with this. This is a big topic, but the very, very insufficient TL;DR is that the Unicode consortium has prescribed standard ways to normalize strings like the above, i.e. how to turn str1 into str2.
Unfortunately, it's impossible to say what the best solution for you is, since you didn't provide any details. Your database might have built-in normalization functionality, but I don't know what database you're using so I can't say. Since you did mention Ruby I can point you to the String#unicode_normalize method, which was introduced in Ruby's standard library in Ruby 2.2:
str1 = "a\xCC\x88"
str2 = "\xC3\xA4"
p str1 == str2 # => false
str1_normalized = str1.unicode_normalize
p str1_normalized == str2
# => true
p str1_normalized.bytes == str2.bytes
# => true
If you don't have Ruby 2.2+, well... upgrade. But if you can't upgrade for some reason you can use ActiveSupport::Multibyte::Unicode.normalize, which is especially convenient if you're using Rails, or the Unicode gem.
One more thing
You don't need to do this, since the above is the correct way to do Unicode normalization in Ruby, but a much easier way to do this:
"WinterIDäSchwiiz".bytes.join(",").gsub("97,204,136","195,164").split(",").collect{|s| s.to_i }.pack('C*').force_encoding('utf-8')
...would have been this:
"WinterIDäSchwiiz".gsub("a\xCC\x88", "\xC3\xA4")
Any time you see something like join(",")...split(",") in Ruby it's almost certainly the wrong solution.

How to create a string with a "bad encoding" in ruby?

I have a file somewhere out in production that I do not have access to that, when loaded by a ruby script, a regular expression against the contents fails with a ArgumentError => invalid byte sequence in UTF-8.
I believe I have a fix based on the answer with all the points here: ruby 1.9: invalid byte sequence in UTF-8
# Remove all invalid and undefined characters in the given string
# (ruby 1.9.3)
def safe_str str
# edited based on matt's comment (thanks matt)
s = str.encode('utf-16', 'utf-8', invalid: :replace, undef: :replace, replace: '')
s.encode!('utf-8', 'utf-16')
end
However, I now want to build my rspec to verify that the code works. I don't have access to the file that caused the problem so I want to create a string with the bad encoding programatically.
I've tried variations on things like:
bad_str = (100..1000).to_a.inject('') {|s,c| s << c; s}
bad_str.length.should > safe_str(bad_str).length
or,
bad_str = (100..1000).to_a.pack(c*)
bad_str.length.should > safe_str(bad_str).length
but the length is always the same. I have also tried different character ranges; not always 100 to 1000.
Any suggestions on how to build a string with an invalid encoding within a ruby 1.9.3 script?

Lots of one-byte strings will make an invalid UTF-8 string, starting with 0x80. So 128.chr should work.

Your safe_str method will (currently) never actually do anything to the string, it is a no-op. The docs for String#encode on Ruby 1.9.3 say:
Please note that conversion from an encoding enc to the same encoding enc is a no-op, i.e. the receiver is returned without any changes, and no exceptions are raised, even if there are invalid bytes.
This is true for the current release of 2.0.0 (patch level 247), however a recent commit to Ruby trunk changes this, and also introduces a scrub method that pretty much does what you want.
Until a new version of Ruby is released you will need to round trip your text string to another encoding and back to clean it, as in the second example in this answer to the question you linked to, something like:
def safe_str str
s = str.encode('utf-16', 'utf-8', invalid: :replace, undef: :replace, replace: '')
s.encode!('utf-8', 'utf-16')
end
Note that your first example of an attempt to create an invalid string won’t work:
bad_str = (100..1000).to_a.inject('') {|s,c| s << c; s}
bad_str.valid_encoding? # => true
From the << docs:
If the object is a Integer, it is considered as a codepoint, and is converted to a character before concatenation.
So you’ll always get a valid string.
Your second method, using pack will create a string with the encoding ASCII-8BIT. If you then change this using force_encoding you can create a UTF-8 string with an invalid encoding:
bad_str = (100..1000).to_a.pack('c*').force_encoding('utf-8')
bad_str.valid_encoding? # => false

Try with s = "hi \255"
s.valid_encoding?
# => false

Following example can be used for testing purposes:
describe TestClass do
let(:non_utf8_text) { "something\255 english." }
it 'is not raise error on invalid byte sequence string' do
expect(non_utf8_text).not_to be_valid_encoding
expect { subject.call(non_utf8_text) }.not_to raise_error
end
end
Thanks to Iwan B. for "\255" advise.

In spec tests I’ve written, I haven’t found a way to fix this bad encoding:
Period%Basics
The %B string consistently produces ArgumentError: invalid byte sequence in UTF-8.

Ruby - Comparing "==" hex value to string

I am basically reading in the header of a picture file and doing a quick comparison to see what kind of file it actually is. BMP, GIF, PNG are all easy as their headers contain BM, GIF, and PNG respectively to identify themselves. JPG is throwing me for a bit of a loop tho.
The first 3 bytes of a jpg tend to be 0xff\0xd8\0xff and for the life of me I can't get a true value in a simple comparison no matter how I set it up.
I read in the first 4 bytes:
if data[0, 3] == "\xff\xd8\xff"
puts "This is a JPG"
end
I know I am close but I just can't get it to work. Please let me know what I'm missing out on here.
Note: I know there are gems to do this for me but I don't want to use a gem. Simple as that.

This is a character encoding issue. Reading the first 4 bytes from a JPEG returns an ASCII encoded string:
head = File.read("some.jpg", 4)
# => "\xFF\xD8\xFF\xE1"
head.encodig
# => #<Encoding:ASCII-8BIT>
Strings on the other hand are UTF-8 encoded:
jpg_prefix = "\xff\xd8\xff"
# => "\xFF\xD8\xFF"
jpg_prefix.encoding
# => #<Encoding:UTF-8>
Comparing UTF-8 and ASCII strings does not work as expected:
head[0,3] == jpg_prefix
# => false
You have to explicitly set the encoding with String#force_encoding:
jpg_prefix = "\xff\xd8\xff".force_encoding(Encoding::ASCII_8BIT)
# => "\xFF\xD8\xFF"
jpg_prefix.encoding
# => #<Encoding:ASCII-8BIT>
head[0,3] == jpg_prefix
# => true
Concatenating ASCII characters created with Integer#chr (as suggested by Mario Visic) also works:
jpg_prefix = 0xff.chr + 0xd8.chr + 0xff.chr
# => "\xFF\xD8\xFF"
jpg_prefix.encoding
# => #<Encoding:ASCII-8BIT>
Or by using Array#pack:
jpg_prefix = ["FFD8FF"].pack("H*")
# => "\xFF\xD8\xFF"
jpg_prefix.encoding
# => #<Encoding:ASCII-8BIT>

Your code works fine for me when Data is a string - but Data is likely an array of byte values.
Try this:
if data[0,3] == [0xff, 0xd8, 0xff]
as your condition.

Identifying files is a good thing to let someone else do, if you can. The ruby-filemagic gem will do this.
gem 'ruby-filemagic'
In use, it returns a string:
require 'filemagic'
magic = FileMagic.new
p magic.file("/tmp/pic1.jpg")
# => "JPEG image data, JFIF standard 1.02"
The returned string can be matched against regular expressions:
case magic.file(path)
when /JPEG/
# do JPEG stuff
when /GIF/
# do GIF stuff
else
# we don't recognize it
end
ruby-filemagic uses the libmagic library, which recognizes a great number of file types.
The documentation is a little sparse (the README doesn't even have a "hello world" example), and it hasn't been updated in a few years, but don't let that deter you from trying it. It's simple enough to use, and pretty solid--I've got production code using it today, and it still works fine.
If, for some reason, you are unable to use the gem, but are in a *nix environment and have access to the "file" command, you can get the same functionality by shelling out to "file":
p `file /tmp/pic1.jpg`
# => "/tmp/pic1.jpg: JPEG image data, JFIF standard 1.02\n
In Debian, the file command is provided by package file. Your OS may differ.

You should be able to compare the file information with the character codes, something like:
if data[0, 3] == 0xff.chr + 0xd8.chr + 0xff.chr
puts "This is a JPG"
end
If you get stuck you can always peek the the fastimage gem's code, the type detection code is here: https://github.com/sdsykes/fastimage/blob/master/lib/fastimage.rb#L337-L354
Like others (#Stefan) mentioned, the strings did not match in your original example because the encodings differed.
# Check the encodings for our strings:
"\xff\xd8\xff".encoding #=> <Encoding:UTF-8>
(0xff.chr + 0xd8.chr + 0xff.chr).encoding #=> <Encoding:ASCII-8BIT>
# Compare our two strings with different encodings:
utf8 = "\xff\xd8\xff"
ascii = 0xff.chr + 0xd8.chr + 0xff.chr
utf8 == ascii #=> false
utf8.force_encoding("ASCII-8BIT") == ascii #=> true
Your original code actually would have worked fine if you forced the encoding to be ASCII-8BIT

Equivalent of Iconv.conv("UTF-8//IGNORE",...) in Ruby 1.9.X?

I'm reading data from a remote source, and occassionally get some characters in another encoding. They're not important.
I'd like to get get a "best guess" utf-8 string, and ignore the invalid data.
Main goal is to get a string I can use, and not run into errors such as:
Encoding::UndefinedConversionError: "\xFF" from ASCII-8BIT to UTF-8:
invalid byte sequence in utf-8

I thought this was it:
string.encode("UTF-8", :invalid => :replace, :undef => :replace, :replace => "?")
will replace all knowns with '?'.
To ignore all unknowns, :replace => '':
string.encode("UTF-8", :invalid => :replace, :undef => :replace, :replace => "")
Edit:
I'm not sure this is reliable. I've gone into paranoid-mode, and have been using:
string.encode("UTF-8", ...).force_encoding('UTF-8')
Script seems to be running, ok now. But I'm pretty sure I'd gotten errors with this earlier.
Edit 2:
Even with this, I continue to get intermittant errors. Not every time, mind you. Just sometimes.

String#chars or String#each_char can be also used.
# Table 3-8. Use of U+FFFD in UTF-8 Conversion
# http://www.unicode.org/versions/Unicode6.2.0/ch03.pdf)
str = "\x61"+"\xF1\x80\x80"+"\xE1\x80"+"\xC2"
+"\x62"+"\x80"+"\x63"+"\x80"+"\xBF"+"\x64"
p [
'abcd' == str.chars.collect { |c| (c.valid_encoding?) ? c : '' }.join,
'abcd' == str.each_char.map { |c| (c.valid_encoding?) ? c : '' }.join
]
String#scrub can be used since Ruby 2.1.
p [
'abcd' == str.scrub(''),
'abcd' == str.scrub{ |c| '' }
]

This works great for me:
"String".encode("UTF-8", :invalid => :replace, :undef => :replace, :replace => "").force_encoding('UTF-8')

To ignore all unknown parts of the string that aren't correctly UTF-8 encoded the following (as you originally posted) almost does what you want.
string.encode("UTF-8", :invalid => :replace, :undef => :replace, :replace => "")
The caveat is that encode doesn't do anything if it thinks the string is already UTF-8. So you need to change encodings, going via an encoding that can still encode the full set of unicode characters that UTF-8 can encode. (If you don't you'll corrupt any characters that aren't in that encoding - 7bit ASCII would be a really bad choice!) So go via UTF-16:
string.encode('UTF-16', :invalid => :replace, :replace => '').encode('UTF-8')

With a bit of help from #masakielastic I have solved this problem for my personal purposes using the #chars method.
The trick is to break down each character into its own separate block so that ruby can fail.
Ruby needs to fail when it confronts binary code etc. If you don't allow ruby to go ahead and fail its a tough road when it comes to this stuff. So I use the String#chars method to break the given string into an array of characters. Then I pass that code into a sanitizing method that allows the code to have "microfailures" (my coinage) within the string.
So, given a "dirty" string, lets say you used File#read on a picture. (my case)
dirty = File.open(filepath).read
clean_chars = dirty.chars.select do |c|
begin
num_or_letter?(c)
rescue ArgumentError
next
end
end
clean = clean_chars.join("")
def num_or_letter?(char)
if char =~ /[a-zA-Z0-9]/
true
elsif char =~ Regexp.union(" ", ".", "?", "-", "+", "/", ",", "(", ")")
true
end
end
allowing the code to fail somewhere along in the process seems to be the best way to move through it. So long as you contain those failures within blocks you can grab what is readable by the UTF-8-only-accepting parts of ruby

I have not had luck with the one-line uses of String#encode ala string.encode("UTF-8", :invalid => :replace, :undef => :replace, :replace => "?")
. Do not work reliably for me.
But I wrote a pure ruby "backfill" of String#scrub to MRI 1.9 or 2.0 or any other ruby that does not offer a String#scrub.
https://github.com/jrochkind/scrub_rb
It makes String#scrub available in rubies that don't have it; if loaded in MRI 2.1, it will do nothing and you'll still be using the built-in String#scrub, so it can allow you to easily write code that will work on any of these platforms.
It's implementation is somewhat similar to some of the other char-by-char solutions proposed in other answers, but it does not use exceptions for flow control (don't do that), is tested, and provides an API compatible with MRI 2.1 String#scrub

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio