Ruby String.encode still gives "invalid byte sequence in UTF-8" - ruby

In IRB, I'm trying the following:
1.9.3p194 :001 > foo = "\xBF".encode("utf-8", :invalid => :replace, :undef => :replace)
=> "\xBF"
1.9.3p194 :002 > foo.match /foo/
ArgumentError: invalid byte sequence in UTF-8
from (irb):2:in `match'
Any ideas what's going wrong?

I'd guess that "\xBF" already thinks it is encoded in UTF-8 so when you call encode, it thinks you're trying to encode a UTF-8 string in UTF-8 and does nothing:
>> s = "\xBF"
=> "\xBF"
>> s.encoding
=> #<Encoding:UTF-8>
\xBF isn't valid UTF-8 so this is, of course, nonsense. But if you use the three argument form of encode:
encode(dst_encoding, src_encoding [, options] ) → str
[...] The second form returns a copy of str transcoded from src_encoding to dst_encoding.
You can force the issue by telling encode to ignore what the string thinks its encoding is and treat it as binary data:
>> foo = s.encode('utf-8', 'binary', :invalid => :replace, :undef => :replace)
=> "�"
Where s is the "\xBF" that thinks it is UTF-8 from above.
You could also use force_encoding on s to force it to be binary and then use the two-argument encode:
>> s.encoding
=> #<Encoding:UTF-8>
>> s.force_encoding('binary')
=> "\xBF"
>> s.encoding
=> #<Encoding:ASCII-8BIT>
>> foo = s.encode('utf-8', :invalid => :replace, :undef => :replace)
=> "�"

If you're only working with ascii characters you can use
>> "Hello \xBF World!".encode('utf-8', 'binary', :invalid => :replace, :undef => :replace)
=> "Hello � World!"
But what happens if we use the same approach with valid UTF8 characters that are invalid in ascii
>> "¡Hace \xBF mucho frío!".encode('utf-8', 'binary', :invalid => :replace, :undef => :replace)
=> "��Hace � mucho fr��o!"
Uh oh! We want frío to remain with the accent. Here's an option that keeps the valid UTF8 characters
>> "¡Hace \xBF mucho frío!".chars.select{|i| i.valid_encoding?}.join
=> "¡Hace mucho frío!"
Also in Ruby 2.1 there is a new method called scrub that solves this problem
>> "¡Hace \xBF mucho frío!".scrub
=> "¡Hace � mucho frío!"
>> "¡Hace \xBF mucho frío!".scrub('')
=> "¡Hace mucho frío!"

This is fixed if you read the source text file in using an explicit code page:
File.open( 'thefile.txt', 'r:iso8859-1' )

Related

Why a dangerous method doesn't work with a character element of String in Ruby?

When I apply the upcase! method I get:
a="hello"
a.upcase!
a # Shows "HELLO"
But in this other case:
b="hello"
b[0].upcase!
b[0] # Shows h
b # Shows hello
I don't understand why the upcase! applied to b[0] doesn't have any efect.
b[0] returns a new String every time. Check out the object id:
b = 'hello'
# => "hello"
b[0].object_id
# => 1640520
b[0].object_id
# => 25290780
b[0].object_id
# => 24940620
When you are selecting an individual character in a string, you're not referencing the specific character, you're calling a accessor/mutator function which performs the evaluation:
2.0.0-p643 :001 > hello = "ruby"
=> "ruby"
2.0.0-p643 :002 > hello[0] = "R"
=> "R"
2.0.0-p643 :003 > hello
=> "Ruby"
In the case when you run a dangerous method, the value is requested by the accessor, then it's manipulated and the new variable is updated, but because there is no longer a connection between the character and the string, it will not update the reference.
2.0.0-p643 :004 > hello = "ruby"
=> "ruby"
2.0.0-p643 :005 > hello[0].upcase!
=> "R"
2.0.0-p643 :006 > hello
=> "ruby"

Ruby encode UTF-8 string to UTF-16

I want to store the UTF-16 encoding into another variable as UTF-8 string.
1.9.3p194 :117 > str = "سلام"
=> "سلام"
1.9.3p194 :118 > enc = str.encode("utf-16")
=> "\uFEFF\u0633\u0644\u0627\u0645"
1.9.3p194 :119 > puts enc
??3D'E
=> nil
I want to store \uFEFF\u0633\u0644\u0627\u0645 (not ??3D'E) into a UTF-8 string so I can be able to concatenate it with other UTF-8 strings
Use String#inspect:
str = "سلام"
# => "سلام"
enc = str.encode("utf-16")
# => "\uFEFF\u0633\u0644\u0627\u0645"
puts enc
# output: ▒▒3D'E
# => nil
puts enc.inspect
# output: "\uFEFF\u0633\u0644\u0627\u0645"
# => nil

Comparing bytes in Ruby

I have a binary blob header of either a JPG or MP4 file. I am trying to differentiate between the two.
When the file is a JPG, the first two bytes are \xFF\xD8. However, when I make the comparison blob[0] == "\xFF", it fails. Even when I know that blob[0] IS in fact \xFF
What is the best way to do this?
This is an encoding issue. You are comparing a string with binary encoding (your JPEG blob) with a UTF-8 encoded string ("\xFF"):
foo = "\xFF".force_encoding("BINARY") # like your blob
bar = "\xFF"
p foo # => "\xFF"
p bar # => "\xFF"
p foo == bar # => false
There are several ways to create a binary encoded string:
str = "\xFF\xD8".b # => "\xFF\xD8" (Ruby 2.x)
str.encoding # => #<Encoding:ASCII-8BIT>
str = "\xFF\xD8".force_encoding("BINARY") # => "\xFF\xD8"
str.encoding # => #<Encoding:ASCII-8BIT>
str = 0xFF.chr + 0xD8.chr # => "\xFF\xD8"
str.encoding # => #<Encoding:ASCII-8BIT>
str = ["FFD8"].pack("H*") # => "\xFF\xD8"
str.encoding # => #<Encoding:ASCII-8BIT>
All of the above can be compared with your blob.

ruby 1.8.7 why .to_yaml converts some Strings to non-readable bytes

Parsing some webpages with nokogiri, i've got some issues while cleaning some Strings and saving them with YAML. To reproduce the problem look at this IRB session that reproduces the same problem:
irb(main):001:0> require 'yaml'
=> true
irb(main):002:0> "1,000 €".to_yaml
=> "--- !binary |\nMSwwMDAg4oKs\n\n"
irb(main):003:0> "1,0000 €".to_yaml
=> "--- \"1,0000 \\xE2\\x82\\xAC\"\n"
irb(main):004:0> "1,00 €".to_yaml
=> "--- !binary |\nMSwwMCDigqw=\n\n"
irb(main):005:0> "1 €".to_yaml
=> "--- !binary |\nMSDigqw=\n\n"
irb(main):006:0> "23 €".to_yaml
=> "--- !binary |\nMjMg4oKs\n\n"
irb(main):007:0> "12000 €".to_yaml
=> "--- !binary |\nMTIwMDAg4oKs\n\n"
irb(main):008:0> "1200000 €".to_yaml
=> "--- \"1200000 \\xE2\\x82\\xAC\"\n"
irb(main):009:0> "120000 €".to_yaml
=> "--- \"120000 \\xE2\\x82\\xAC\"\n"
irb(main):010:0> "12000 €".to_yaml
=> "--- !binary |\nMTIwMDAg4oKs\n\n"
To sum up, sometimes .to_yaml outputs are readable while other times the output is unreadable. The most intriguing aspect is that the strings are very similar.
How can I avoid those !binary ... outputs?
Whether YAML prefers to dump a string as text or binary is a matter of ratio between ASCII and non ASCII characters.
If you want to avoid !binary as much as possible, you should use the ya2yaml gem. It tries hard to dump strings as ASCII + escaped UTF-8.

Transliteration with Iconv in Ruby

When I'm trying to transliterate a Cyrillic utf-8 string with
Iconv.iconv('ascii//ignore//translit', 'utf-8', string).to_s
(see questions/1726404/transliteration-in-ruby)
I'm getting everything but those symbols that have to be transliterated.
For example: 'r-строка' → 'r-' and 'Gévry' → 'Gvry'.
What's wrong?
Ruby 1.8.7 / Rails 2.3.5 / WSeven
require 'iconv'
p Iconv.iconv('ascii//translit//ignore', 'utf-8', 'Gévry') #=> ["Gevry"]
# not 'ascii//ignore//translit'
For Cyrillic the translit gem might work.
It seems the solution is too tricky for me. Problem solved using stringex gem.
Another way is to create custom translit by tr and gsub methods of String without using iconv.
# encoding: UTF-8
def russian_translit(text)
translited = text.tr('абвгдеёзийклмнопрстуфхэыь', 'abvgdeezijklmnoprstufhey\'')
translited = translited.tr('АБВГДЕЁЗИЙКЛМНОПРСТУФХЭ', 'ABVGDEEZIJKLMNOPRSTUFHEY\'')
translited = translited.gsub(/[жцчшщъюяЖЦЧШЩЪЮЯ]/,
'ж' => 'zh', 'ц' => 'ts', 'ч' => 'ch', 'ш' => 'sh', 'щ' => 'sch', 'ъ' => '', 'ю' => 'ju', 'я' => 'ja',
'Ж' => 'ZH', 'Ц' => 'TS', 'Ч' => 'CH', 'Ш' => 'SH', 'Щ' => 'SCH', 'Ъ' => '', 'Ю' => 'JU', 'Я' => 'JA')
return translited
end
p russian_translit("В чащах юга жил бы цитрус? Да, но фальшивый экземпляр!")
#=> "V chaschah juga zhil by tsitrus? Da, no fal'shivyj ekzempljar!"

Resources