How to replace multibyte characters in ruby using gsub? - ruby

I have a problem with saving records in MongoDB using Mongoid when they contain multibyte characters. This is the string:
a="Chris \xA5\xEB\xAE\xDFe\xA5"
I first convert it to BINARY and I then gsub it like this:
a.force_encoding("BINARY").gsub(0xA5.chr,"oo")
...which works fine:
=> "Chris oo\xEB\xAE\xDFeoo"
But it seems that I can not use the chr method if I use Regexp:
a.force_encoding("BINARY").gsub(/0x....?/.chr,"")
NoMethodError: undefined method `chr' for /0x....?/:Regexp
Anybody with the same issue?
Thanks a lot...

You can do that with interpolation
a.force_encoding("BINARY").gsub(/#{0xA5.chr}/,"")
gives
"Chris \xEB\xAE\xDFe"
EDIT: based on the comments, here a version that translates the binary encode string to an ascii representation and do a regex on that string
a.unpack('A*').to_s.gsub(/\\x[A-F0-9]{2}/,"")[2..-3] #=>"Chris "
the [2..-3] at the end is to get rid of the beginning [" and and trailing "]
NOTE: to just get rid of the special characters you also could just use
a.gsub(/\W/,"") #=> "Chris"

The actual string does not contain the literal characters \xA5: that is just how characters that would otherwise be unprintable are shown to you (similar when a string contains a newline ruby shows you \n).
If you want to change any non ascii stuff you could do this
a="Chris \xA5\xEB\xAE\xDFe\xA5"
a.force_encoding('BINARY').encode('ASCII', :invalid => :replace, :undef => :replace, :replace => 'oo')
This starts by forcing the string to the binary encoding (you always want to start with a string where the bytes are valid for its encoding. binary is always valid since it can contain arbitrary bytes). Then it converts it to ASCII. Normally this would raise an error since there are characters that it doesn't know what to do with but the extra options we've passed tell it to replace invalid/undefined sequences with the characters 'oo'

Related

Ruby convert non-printable characters into numbers

I have a string with non-printable characters.
What I am currently doing is replacing them with a tilde using:
string.gsub!(/^[:print:]]/, "~")
However, I would actually like to convert them to their integer value.
I tried this, but it always outputs 0
string.gsub!(/[^[:print:]]/, "#{$1.to_i}")
Thoughts?
String#gsub, String#gsub! accept optional block. The return value of the block is used for substitution.
"\x01Hello\x02".gsub(/[^[:print:]]/) { |x| x.ord }
# => "1Hello2"
Object#inspect is also an option if you just need to output string with non-printable characters to log or for debug purposes.
puts "\x01Hello\x02".inspect
# => "\u0001Hello\u0002"

can't convert "[" with encoding "gb2312" to "utf-8" in ruby1.9.3

I'm learning ruby and try to get the filename from a ftp server. The string I got was encoded in gb2312(simplified Chinese), It's success in most cases with these codes:
str = str.force_encoding("gb2312")
str = str.encode("utf-8")
but it will make an error "in encode': "\xFD" followed by "\x88" on GB2312 (Encoding::InvalidByteSequenceError)" if the string contains the symbol "[" or "【".
The Ruby Encoding allows a lot of introspection. That way, you can find out pretty well, how to handle a given String:
"【".encoding
=> #<Encoding:UTF-8>
"【".valid_encoding?
=> true
"【".force_encoding("gb2312").valid_encoding?
=> false
That shows that this character is not with the given character-set! If you need to transform all those characters, you can use the encode method and provide defaults or replace undefined characters like so:
"【".encode("gb2312", invalid: :replace, undef: :replace)
=> "\x{A1BE}"
If you have a String that has mixed character Encodings, you are pretty screwed. There is no way to find out without a lot of guessing.

How can I convert a string of codepoints to the string it represents?

I have a string (in Ruby) like this:
626c6168
(that is 'blah' without the quotes)
How do I convert it to 'blah'? Note that these are variable lengths, and also they aren't always letters and numbers. (They're being stored in a database, not being printed.)
Array#pack
['626c6168'].pack('H*')
# => "blah"
Using hex to convert each character:
"626c6168".scan(/../).map{ |c| c.hex.chr }.join
This gives blah.

Escape problem with hex

I need to print escaped characters to a binary file using Ruby. The main problem is that slashes need the whole byte to escape correctly, and I don't know/can't create the byte in such a way.
I am creating the hex value with, basically:
'\x' + char
Where char is some 'hex' value, such as 65. In hex, \x65 is the ASCII character 'e'.
Unfortunately, when I puts this sequence to the file, I end up with this:
\\x65
How do I create a hex string with the properly escaped value? I have tried a lot of things, involving single or double quotes, pack, unpack, multiple slashes, etc. I have tried so many different combinations that I feel as though I understand the problem less now then I did when I started.
How?
You may need to set binary mode on your file, and/or use putc.
File.open("foo.tmp", "w") do |f|
f.set_encoding(Encoding::BINARY) # set_encoding is Ruby 1.9
f.binmode # only useful on Windows
f.putc "e".hex
end
Hopefully this can give you some ideas even if you have Ruby <1.9.
Okay, if you want to create a string whose first byte
has the integer value 0x65, use Array#pack
irb> [0x65].pack('U')
#=> "e"
irb> "e"[0]
#=> 101
10110 = 6516, so this works.
If you want to create a literal string whose first byte is '\',
second is 'x', third is '6', and fourth is '5', then just use interpolation:
irb> "\\x#{65}"
#=> "\\x65"
irb> "\\x65".split('')
#=> ["\\", "x", "6", "5"]
If you have the hex value and you want to create a string containing the character corresponding to that hex value, you can do:
irb(main):002:0> '65'.hex.chr
=> "e"
Another option is to use Array#pack; this can be used if you need to convert a list of numbers to a single string:
irb(main):003:0> ['65'.hex].pack("C")
=> "e"
irb(main):004:0> ['66', '6f', '6f'].map {|x| x.hex}.pack("C*")
=> "foo"

How to extract a single character (as a string) from a larger string in Ruby?

What is the Ruby idiomatic way for retrieving a single character from a string as a one-character string? There is the str[n] method of course, but (as of Ruby 1.8) it returns a character code as a fixnum, not a string. How do you get to a single-character string?
In Ruby 1.9, it's easy. In Ruby 1.9, Strings are encoding-aware sequences of characters, so you can just index into it and you will get a single-character string out of it:
'µsec'[0] => 'µ'
However, in Ruby 1.8, Strings are sequences of bytes and thus completely unaware of the encoding. If you index into a string and that string uses a multibyte encoding, you risk indexing right into the middle of a multibyte character (in this example, the 'µ' is encoded in UTF-8):
'µsec'[0] # => 194
'µsec'[0].chr # => Garbage
'µsec'[0,1] # => Garbage
However, Regexps and some specialized string methods support at least a small subset of popular encodings, among them some Japanese encodings (e.g. Shift-JIS) and (in this example) UTF-8:
'µsec'.split('')[0] # => 'µ'
'µsec'.split(//u)[0] # => 'µ'
Before Ruby 1.9:
'Hello'[1].chr # => "e"
Ruby 1.9+:
'Hello'[1] # => "e"
A lot has changed in Ruby 1.9 including string semantics.
Should work for Ruby before and after 1.9:
'Hello'[2,1] # => "l"
Please see Jörg Mittag's comment: this is correct only for single-byte character sets.
'abc'[1..1] # => "b"
'abc'[1].chr # => "b"

Resources