Ruby encode UTF-8 string to UTF-16 - ruby

I want to store the UTF-16 encoding into another variable as UTF-8 string.
1.9.3p194 :117 > str = "سلام"
=> "سلام"
1.9.3p194 :118 > enc = str.encode("utf-16")
=> "\uFEFF\u0633\u0644\u0627\u0645"
1.9.3p194 :119 > puts enc
??3D'E
=> nil
I want to store \uFEFF\u0633\u0644\u0627\u0645 (not ??3D'E) into a UTF-8 string so I can be able to concatenate it with other UTF-8 strings

Use String#inspect:
str = "سلام"
# => "سلام"
enc = str.encode("utf-16")
# => "\uFEFF\u0633\u0644\u0627\u0645"
puts enc
# output: ▒▒3D'E
# => nil
puts enc.inspect
# output: "\uFEFF\u0633\u0644\u0627\u0645"
# => nil

Related

Why a dangerous method doesn't work with a character element of String in Ruby?

When I apply the upcase! method I get:
a="hello"
a.upcase!
a # Shows "HELLO"
But in this other case:
b="hello"
b[0].upcase!
b[0] # Shows h
b # Shows hello
I don't understand why the upcase! applied to b[0] doesn't have any efect.
b[0] returns a new String every time. Check out the object id:
b = 'hello'
# => "hello"
b[0].object_id
# => 1640520
b[0].object_id
# => 25290780
b[0].object_id
# => 24940620
When you are selecting an individual character in a string, you're not referencing the specific character, you're calling a accessor/mutator function which performs the evaluation:
2.0.0-p643 :001 > hello = "ruby"
=> "ruby"
2.0.0-p643 :002 > hello[0] = "R"
=> "R"
2.0.0-p643 :003 > hello
=> "Ruby"
In the case when you run a dangerous method, the value is requested by the accessor, then it's manipulated and the new variable is updated, but because there is no longer a connection between the character and the string, it will not update the reference.
2.0.0-p643 :004 > hello = "ruby"
=> "ruby"
2.0.0-p643 :005 > hello[0].upcase!
=> "R"
2.0.0-p643 :006 > hello
=> "ruby"

Regular expression for only 2 letters

I need to create regular expression for 2 and only 2 letters. I understood it has to be the following /[a-z]{2}/i, but it matches any string with 2 or more letters. Here is what I get:
my_reg_exp = /[a-z]{2}/i
my_reg_exp.match('aa') # => #<MatchData "aa">
my_reg_exp.match('AA') # => #<MatchData "AA">
my_reg_exp.match('a') # => nil
my_reg_exp.match('aaa') # => #<MatchData "aa">
Any suggestion?
You can add the anchors like this:
my_reg_exp = /^[a-z]{2}$/i
Test:
my_reg_exp.match('aaa')
#=> nil
my_reg_exp.match('aa')
#=> #<MatchData "aa">
Hao's solution matches isn't locale sensitive. If this is important for your use case:
/\a[[:alpha:]]{2}\z/
2.0.0-p451 :005 > 'aba' =~ /\A[[:alpha:]]{2}\Z/
=> nil
2.0.0-p451 :006 > 'ab' =~ /\A[[:alpha:]]{2}\Z/
=> 0
2.0.0-p451 :007 > 'xy' =~ /\A[[:alpha:]]{2}\Z/
=> 0
2.0.0-p451 :008 > 'zxy' =~ /\A[[:alpha:]]{2}\Z/
=> nil
Per usual, if you need further assistance, leave a comment.
You can use /\b[a-z]{2}\b/i to match a two-letter string. /b Matches a word-break.
This means you can scan a string to find all occurrences:
'Foo is a bar'.scan(/\b[a-z]{2}\b/i) #=> ["is"]
Or find the first match in a string using:
'a bc def'[/\b[a-z]{2}\b/i] # => "bc"

Comparing bytes in Ruby

I have a binary blob header of either a JPG or MP4 file. I am trying to differentiate between the two.
When the file is a JPG, the first two bytes are \xFF\xD8. However, when I make the comparison blob[0] == "\xFF", it fails. Even when I know that blob[0] IS in fact \xFF
What is the best way to do this?
This is an encoding issue. You are comparing a string with binary encoding (your JPEG blob) with a UTF-8 encoded string ("\xFF"):
foo = "\xFF".force_encoding("BINARY") # like your blob
bar = "\xFF"
p foo # => "\xFF"
p bar # => "\xFF"
p foo == bar # => false
There are several ways to create a binary encoded string:
str = "\xFF\xD8".b # => "\xFF\xD8" (Ruby 2.x)
str.encoding # => #<Encoding:ASCII-8BIT>
str = "\xFF\xD8".force_encoding("BINARY") # => "\xFF\xD8"
str.encoding # => #<Encoding:ASCII-8BIT>
str = 0xFF.chr + 0xD8.chr # => "\xFF\xD8"
str.encoding # => #<Encoding:ASCII-8BIT>
str = ["FFD8"].pack("H*") # => "\xFF\xD8"
str.encoding # => #<Encoding:ASCII-8BIT>
All of the above can be compared with your blob.

Ruby String.encode still gives "invalid byte sequence in UTF-8"

In IRB, I'm trying the following:
1.9.3p194 :001 > foo = "\xBF".encode("utf-8", :invalid => :replace, :undef => :replace)
=> "\xBF"
1.9.3p194 :002 > foo.match /foo/
ArgumentError: invalid byte sequence in UTF-8
from (irb):2:in `match'
Any ideas what's going wrong?
I'd guess that "\xBF" already thinks it is encoded in UTF-8 so when you call encode, it thinks you're trying to encode a UTF-8 string in UTF-8 and does nothing:
>> s = "\xBF"
=> "\xBF"
>> s.encoding
=> #<Encoding:UTF-8>
\xBF isn't valid UTF-8 so this is, of course, nonsense. But if you use the three argument form of encode:
encode(dst_encoding, src_encoding [, options] ) → str
[...] The second form returns a copy of str transcoded from src_encoding to dst_encoding.
You can force the issue by telling encode to ignore what the string thinks its encoding is and treat it as binary data:
>> foo = s.encode('utf-8', 'binary', :invalid => :replace, :undef => :replace)
=> "�"
Where s is the "\xBF" that thinks it is UTF-8 from above.
You could also use force_encoding on s to force it to be binary and then use the two-argument encode:
>> s.encoding
=> #<Encoding:UTF-8>
>> s.force_encoding('binary')
=> "\xBF"
>> s.encoding
=> #<Encoding:ASCII-8BIT>
>> foo = s.encode('utf-8', :invalid => :replace, :undef => :replace)
=> "�"
If you're only working with ascii characters you can use
>> "Hello \xBF World!".encode('utf-8', 'binary', :invalid => :replace, :undef => :replace)
=> "Hello � World!"
But what happens if we use the same approach with valid UTF8 characters that are invalid in ascii
>> "¡Hace \xBF mucho frío!".encode('utf-8', 'binary', :invalid => :replace, :undef => :replace)
=> "��Hace � mucho fr��o!"
Uh oh! We want frío to remain with the accent. Here's an option that keeps the valid UTF8 characters
>> "¡Hace \xBF mucho frío!".chars.select{|i| i.valid_encoding?}.join
=> "¡Hace mucho frío!"
Also in Ruby 2.1 there is a new method called scrub that solves this problem
>> "¡Hace \xBF mucho frío!".scrub
=> "¡Hace � mucho frío!"
>> "¡Hace \xBF mucho frío!".scrub('')
=> "¡Hace mucho frío!"
This is fixed if you read the source text file in using an explicit code page:
File.open( 'thefile.txt', 'r:iso8859-1' )

Why are two strings with same bytes and encoding not identical in Ruby 1.9?

In Ruby 1.9.2, I found a way to make two strings that have the same bytes, same encoding, and are equal, but they have a different length and different characters returned by [].
Is this a bug? If it is not a bug, then I'd like to fully understand it. What kind of information is stored inside Ruby 1.9.2 String objects that allows these two strings to behave differently?
Below is the code that reproduces this behavior. The comments that start with #=> show you what output I am getting from this script, and the parenthetical words tell you my judgment of that output.
#!/usr/bin/ruby1.9
# coding: utf-8
string1 = "\xC2\xA2" # A well-behaved string with one character (¢)
string2 = "".concat(0xA2) # A bizarre string very similar to string1.
p string1.bytes.to_a #=> [194, 162] (good)
p string2.bytes.to_a #=> [194, 162] (good)
puts string1.encoding.name #=> UTF-8 (good)
puts string2.encoding.name #=> UTF-8 (good)
puts string1 == string2 #=> true (good)
puts string1.length #=> 1 (good)
puts string2.length #=> 2 (weird!)
p string1[0] #=> "¢" (good)
p string2[0] #=> "\xC2" (weird!)
I am running Ubuntu and compiled Ruby from source. My Ruby version is:
ruby 1.9.2p0 (2010-08-18 revision 29036) [x86_64-linux]
It is Ruby's bug and fixed r29848.
Matz mentioned this question via Twitter:
http://twitter.com/matz_translator/status/6597021662187520
http://twitter.com/matz_translator/status/6597055132733440
"It's hard to determine as a bug but, it's not acceptable to leave it as is. I'd prefer to fix this issue."
I think the problem is in the string's encoding. Check out James Grey's Shades of Gray: Ruby 1.9's String article on Unicode encoding.
Additional odd behavior:
# coding: utf-8
string1 = "\xC2\xA2"
string2 = "".concat(0xA2)
string3 = 0xC2.chr + 0xA2.chr
string1.bytes.to_a # => [194, 162]
string2.bytes.to_a # => [194, 162]
string3.bytes.to_a # => [194, 162]
string1.encoding.name # => "UTF-8"
string2.encoding.name # => "UTF-8"
string3.encoding.name # => "ASCII-8BIT"
string1 == string2 # => true
string1 == string3 # => false
string2 == string3 # => true
string1.length # => 1
string2.length # => 2
string3.length # => 2
string1[0] # => "¢"
string2[0] # => "\xC2"
string3[0] # => "\xC2"
string3.unpack('C*') # => [194, 162]
string4 = string3.unpack('C*').pack('C*') # => "\xC2\xA2"
string4.encoding.name # => "ASCII-8BIT"
string4.force_encoding('UTF-8') # => "¢"
string3.force_encoding('UTF-8') # => "¢"
string3.encoding.name # => "UTF-8"

Resources