Difference between Encoding::BINARY and Encoding::ASCII-8BIT?

Difference between Encoding::BINARY and Encoding::ASCII-8BIT? - ruby

Ruby says that Encoding::BINARY and Encoding::ASCII-8BIT are the same.
Encoding::BINARY == Encoding::ASCII_8BIT
#=> true
We explicitly create a binary string and ruby still says it's ASCII_8BIT.
String.new("ABC", encoding: Encoding::BINARY).encoding
#=> #<Encoding:ASCII-8BIT>
Likewise, force_encoding cannot create a BINARY, it just creates an ASCII-8BIT string.
It seems that BINARY is simply an alias for ASCII-8BIT. Are there any differences?

Your observation is correct: BINARY and ASCII-8BIT are indeed aliases and being an alias implies there are no differences as it's just another name for the same encoding, method, etc.
Looking at the source code is the most reliable way to confirm this. CRuby's character encodings can be found in the enc directory. The ASCII-8BIT encoding is defined in the ascii.c file containing the following line (in 2.5.0, it's line 61):
ENC_ALIAS("BINARY", "ASCII-8BIT")
ENC_ALIAS works like Ruby's alias keyword (alias, original name).
Confirming that BINARY or another encoding name is an alias can be done in pure Ruby too. One possibility is calling the Encoding.aliases method which returns a hash (alias => original):
Encoding.aliases['BINARY'] # => "ASCII-8BIT"
Other useful methods are Encoding#name which returns the original name and Encoding#names which also returns all aliases:
Encoding::BINARY.names # => ["ASCII-8BIT", "BINARY"]
Encoding::US_ASCII.names # => ["US-ASCII", "ASCII", "ANSI_X3.4-1968", "646"]
Or a way without any Encoding methods:
Encoding::BINARY.equal?(Encoding::ASCII_8BIT)
As the == method is often overwritten and may return true even if both operands are two different objects, BasicObject#equal? should be called to check if they are the same object. E.g. 1 and 1.0 have the same value (== returns true) but not the same object identity (equal? returns false).

Related

What does the "-" mean in front of a ruby symbol?

When I had a look into the ActiveRecord source today, I stumbled upon these lines
name = -name.to_s
https://github.com/rails/rails/blob/2459c20afb508c987347f52148210d874a9af4fa/activerecord/lib/active_record/reflection.rb#L24
and
ar.aggregate_reflections = ar.aggregate_reflections.merge(-name.to_s => reflection)
https://github.com/rails/rails/blob/2459c20afb508c987347f52148210d874a9af4fa/activerecord/lib/active_record/reflection.rb#L29
What purpose does the - operator serve for on the symbol name?

That's String#-#:
Returns a frozen, possibly pre-existing copy of the string.
Example:
a = "foo"
b = "foo"
a.object_id #=> 6980
b.object_id #=> 7000
vs:
a = -"foo"
b = -"foo"
a.object_id #=> 6980
b.object_id #=> 6980

What purpose does the - operator serve for on the symbol name?
You have your precedence rules wrong: the binary message sending operator (.) has higher precedence than everything else, which means - is not applied to the expression name but to the expression name.to_s.
In other words, you seem to think that this expression is parsed like this:
(-name).to_s
# which is the same as
name.-#().to_s()
but it is actually parsed as
-(name.to_s)
# which is the same as
name.to_s().-#()
Now, we don't know what name is, but unless someone is seriously messing with you, #to_s should return a String. In other words, the operator is not applied to a Symbol, as you thought.
Hence, we know that we are sending the message -# to a String and can thus look up what String#-# does in the documentation:
-string → frozen_string
Returns a frozen, possibly pre-existing copy of the string.
The returned String will be deduplicated as long as it does not have any instance variables set on it.
Dynamically created Strings are not frozen by default. Only static String literals are, depending on your setting of the magic comment # frozen_string_literals: true. String#-# was added as an alias for String#freeze to allow you to freeze and de-duplicate a String with as little syntactic noise as possible.
The opposite operation is also available as String#+#.

Ruby: Testing a ruby string for a substring fails (substring is not recognized)

Using Ruby, I am trying to weed out spam messages the manual way, so why exactly does the below test return false when it should return true? The tested string is the original one, so you can literally copy/paste the whole thing into your ruby console to verify this example:
irb(main):053:0> "Веautiful women fоr sеx in yоur town АU: https://links.wtf/qLFs".include? "sex"
=> false
Hint: If you replace the word "sex" inside the entire string by typing it in yourself, the test will return true as expected. So, somehow, the two "sex" strings are not the same, but on what level? How to test that correctly?
EDIT:
I have narrowed it all down to this (copy/paste it to test it!):
irb(main):073:0> "е" == "e"
=> false

JavaScript's charCodeAt method tells me that the two characters are a different Unicode value. Ruby's .ord method tells me the same thing. You could check against those Unicode values more literally in Ruby, but I'd recommend finding a way to normalize the data instead of adding endless conditionals for unusual characters. It looks like that is a 0x0435 1077 CYRILLIC SMALL LETTER IE е according to a Unicode lookup table I found online.
Alternatively, here's one approach where you could just ban all Cyrillic characters. I used a full range of excluded characters so you could add exclusions as needed.
#!/usr/bin/env ruby
CYRILLIC_UNICODE_DECIMALS = *(1024..1273).freeze
for arg in ARGV
# next unless arg.is_a?(String)
arg.split('').each do |char|
p char if CYRILLIC_UNICODE_DECIMALS.include?(char.ord)
end
end
For reference, these are the .ord and .charCodeAt methods I used against your example. I started with JavaScript because it's a simple test in the browser console.
2.6.3 :005 > 'е'.ord
=> 1077
2.6.3 :006 > 'e'.ord
=> 101
'"е" == "e"'.charCodeAt(1)
1077
'"e" == "e"'.charCodeAt(1)
101

What's the difference between CGI.unescape and URI.decode_www_form_component?

These functions seem to do the same thing.
irb> CGI.unescape "Sloths%3A+Society+and+Habitat"
=> "Sloths: Society and Habitat"
irb> URI.decode_www_form_component "Sloths%3A+Society+and+Habitat"
=> "Sloths: Society and Habitat"
What's the difference?

These methods are very similar. They both accept a string and an encoding and return a string in the specified encoding with the % escapes decoded. But there are differences:
Invalid escapes
URI.decode_www_form_component raises an ArgumentError if the string contains invalid escape sequences.
URI.decode_www_form_component('%xz')
# ArgumentError: invalid %-encoding (%xz)
CGI.unescape simply ignores them.
CGI.unescape('%xz')
# "%xz"
Invalid encodings
CGI.unescape ignores your specified encoding if the result is invalid
p CGI.unescape("\u263a", 'ASCII')
# "☺"
URI.decode_www_form_component doesn't care
p URI.decode_www_form_component("\u263a", 'ASCII')
# "\xE2\x98\xBA"
Lastly (and I hesitate to even mention this), URI.decode_www_form_component is slightly faster because it uses a precomputed Hash to decode all 485 valid escape codes (it's case-sensitive), whereas CGI.unescape actually interprets the hex code and repacks it as a character.

How to create a string with a "bad encoding" in ruby?

I have a file somewhere out in production that I do not have access to that, when loaded by a ruby script, a regular expression against the contents fails with a ArgumentError => invalid byte sequence in UTF-8.
I believe I have a fix based on the answer with all the points here: ruby 1.9: invalid byte sequence in UTF-8
# Remove all invalid and undefined characters in the given string
# (ruby 1.9.3)
def safe_str str
# edited based on matt's comment (thanks matt)
s = str.encode('utf-16', 'utf-8', invalid: :replace, undef: :replace, replace: '')
s.encode!('utf-8', 'utf-16')
end
However, I now want to build my rspec to verify that the code works. I don't have access to the file that caused the problem so I want to create a string with the bad encoding programatically.
I've tried variations on things like:
bad_str = (100..1000).to_a.inject('') {|s,c| s << c; s}
bad_str.length.should > safe_str(bad_str).length
or,
bad_str = (100..1000).to_a.pack(c*)
bad_str.length.should > safe_str(bad_str).length
but the length is always the same. I have also tried different character ranges; not always 100 to 1000.
Any suggestions on how to build a string with an invalid encoding within a ruby 1.9.3 script?

Lots of one-byte strings will make an invalid UTF-8 string, starting with 0x80. So 128.chr should work.

Your safe_str method will (currently) never actually do anything to the string, it is a no-op. The docs for String#encode on Ruby 1.9.3 say:
Please note that conversion from an encoding enc to the same encoding enc is a no-op, i.e. the receiver is returned without any changes, and no exceptions are raised, even if there are invalid bytes.
This is true for the current release of 2.0.0 (patch level 247), however a recent commit to Ruby trunk changes this, and also introduces a scrub method that pretty much does what you want.
Until a new version of Ruby is released you will need to round trip your text string to another encoding and back to clean it, as in the second example in this answer to the question you linked to, something like:
def safe_str str
s = str.encode('utf-16', 'utf-8', invalid: :replace, undef: :replace, replace: '')
s.encode!('utf-8', 'utf-16')
end
Note that your first example of an attempt to create an invalid string won’t work:
bad_str = (100..1000).to_a.inject('') {|s,c| s << c; s}
bad_str.valid_encoding? # => true
From the << docs:
If the object is a Integer, it is considered as a codepoint, and is converted to a character before concatenation.
So you’ll always get a valid string.
Your second method, using pack will create a string with the encoding ASCII-8BIT. If you then change this using force_encoding you can create a UTF-8 string with an invalid encoding:
bad_str = (100..1000).to_a.pack('c*').force_encoding('utf-8')
bad_str.valid_encoding? # => false

Try with s = "hi \255"
s.valid_encoding?
# => false

Following example can be used for testing purposes:
describe TestClass do
let(:non_utf8_text) { "something\255 english." }
it 'is not raise error on invalid byte sequence string' do
expect(non_utf8_text).not_to be_valid_encoding
expect { subject.call(non_utf8_text) }.not_to raise_error
end
end
Thanks to Iwan B. for "\255" advise.

In spec tests I’ve written, I haven’t found a way to fix this bad encoding:
Period%Basics
The %B string consistently produces ArgumentError: invalid byte sequence in UTF-8.

ruby 1.9, force_encoding, but check

I have a string I have read from some kind of input.
To the best of my knowledge, it is UTF8. Okay:
string.force_encoding("utf8")
But if this string has bytes in it that are not in fact legal UTF8, I want to know now and take action.
Ordinarily, will force_encoding("utf8") raise if it encounters such bytes? I believe it will not.
If I was doing an #encode I could choose from the handy options with what to do with characters that are invalid in the source encoding (or destination encoding).
But I'm not doing an #encode, I'm doing a #force_encoding. It has no such options.
Would it make sense to
string.force_encoding("utf8").encode("utf8")
to get an exception right away? Normally encoding from utf8 to utf8 doesn't make any sense. But maybe this is the way to get it to raise right away if there's invalid bytes? Or use the :replace option etc to do something different with invalid bytes?
But no, can't seem to make that work either.
Anyone know?
1.9.3-p0 :032 > a = "bad: \xc3\x28 okay".force_encoding("utf-8")
=> "bad: \xC3( okay"
1.9.3-p0 :033 > a.valid_encoding?
=> false
Okay, but how do I find and eliminate those bad bytes? Oddly, this does NOT raise:
1.9.3-p0 :035 > a.encode("utf-8")
=> "bad: \xC3( okay"
If I was converting to a different encoding, it would!
1.9.3-p0 :039 > a.encode("ISO-8859-1")
Encoding::InvalidByteSequenceError: "\xC3" followed by "(" on UTF-8
Or if I told it to, it'd replace it with a "?" =>
1.9.3-p0 :040 > a.encode("ISO-8859-1", :invalid => :replace)
=> "bad: ?( okay"
So ruby's got the smarts to know what are bad bytes in utf-8, and to replace em with something else -- when converting to a different encoding. But I don't want to convert to a different encoding, i want to stay utf8 -- but I might want to raise if there's an invalid byte in there, or I might want to replace invalid bytes with replacement chars.
Isn't there some way to get ruby to do this?
update I believe this has finally been added to ruby in 2.1, with String#scrub present in the 2.1 preview release to do this. So look for that!

(update: see https://github.com/jrochkind/scrub_rb)
So I coded up a solution to what I needed here: https://github.com/jrochkind/ensure_valid_encoding/blob/master/lib/ensure_valid_encoding.rb
But only much more recently did I realize this actually IS built into the stdlib, you just need to, somewhat counter-intuitively, pass 'binary' as the "source encoding":
a = "bad: \xc3\x28 okay".force_encoding("utf-8")
a.encode("utf-8", "binary", :undef => :replace)
=> "bad: �( okay"
Yep, that's exactly what I wanted. So turns out this IS built into 1.9 stdlib, it's just undocumented and few people know it (or maybe few people that speak English know it?). Although I saw these arguments used this way on a blog somewhere, so someone else knew it!

In ruby 2.1, the stdlib finally supports this with scrub.
http://ruby-doc.org/core-2.1.0/String.html#method-i-scrub

make sure that your scriptfile itself is saved as UTF8 and try the following
# encoding: UTF-8
p [a = "bad: \xc3\x28 okay", a.valid_encoding?]
p [a.force_encoding("utf-8"), a.valid_encoding?]
p [a.encode!("ISO-8859-1", :invalid => :replace), a.valid_encoding?]
This gives on my windows7 system the following
["bad: \xC3( okay", false]
["bad: \xC3( okay", false]
["bad: ?( okay", true]
So your bad char is replaced, you can do it right away as follows
a = "bad: \xc3\x28 okay".encode!("ISO-8859-1", :invalid => :replace)
=> "bad: ?( okay"
EDIT: here a solution that works on any arbitrary encoding, the first encodes only the bad chars, the second just replaces by a ?
def validate_encoding(str)
str.chars.collect do |c|
(c.valid_encoding?) ? c:c.encode!(Encoding.locale_charmap, :invalid => :replace)
end.join
end
def validate_encoding2(str)
str.chars.collect do |c|
(c.valid_encoding?) ? c:'?'
end.join
end
a = "bad: \xc3\x28 okay"
puts validate_encoding(a) #=>bad: ?( okay
puts validate_encoding(a).valid_encoding? #=>true
puts validate_encoding2(a) #=>bad: ?( okay
puts validate_encoding2(a).valid_encoding? #=>true

To check that a string has no invalid sequences, try to convert it to the binary encoding:
# Returns true if the string has only valid sequences
def valid_encoding?(string)
string.encode('binary', :undef => :replace)
true
rescue Encoding::InvalidByteSequenceError => e
false
end
p valid_encoding?("\xc0".force_encoding('iso-8859-1')) # true
p valid_encoding?("\u1111") # true
p valid_encoding?("\xc0".force_encoding('utf-8')) # false
This code replaces undefined characters, because we don't care if there are valid sequences that cannot be represented in binary. We only care if there are invalid sequences.
A slight modification to this code returns the actual error, which has valuable information about the improper encoding:
# Returns the encoding error, or nil if there isn't one.
def encoding_error(string)
string.encode('binary', :undef => :replace)
nil
rescue Encoding::InvalidByteSequenceError => e
e.to_s
end
# Returns truthy if the string has only valid sequences
def valid_encoding?(string)
!encoding_error(string)
end
puts encoding_error("\xc0".force_encoding('iso-8859-1')) # nil
puts encoding_error("\u1111") # nil
puts encoding_error("\xc0".force_encoding('utf-8')) # "\xC0" on UTF-8

About the only thing I can think of is to transcode to something and back that won't damage the string in the round-trip:
string.force_encoding("UTF-8").encode("UTF-32LE").encode("UTF-8")
Seems rather wasteful, though.

Okay, here's a really lame pure ruby way to do it I figured out myself. It probably performs for crap. what the heck, ruby? Not selecting my own answer for now, hoping someone else will show up and give us something better.
# Pass in a string, will raise an Encoding::InvalidByteSequenceError
# if it contains an invalid byte for it's encoding; otherwise
# returns an equivalent string.
#
# OR, like String#encode, pass in option `:invalid => :replace`
# to replace invalid bytes with a replacement string in the
# returned string. Pass in the
# char you'd like with option `:replace`, or will, like String#encode
# use the unicode replacement char if it thinks it's a unicode encoding,
# else ascii '?'.
#
# in any case, method will raise, or return a new string
# that is #valid_encoding?
def validate_encoding(str, options = {})
str.chars.collect do |c|
if c.valid_encoding?
c
else
unless options[:invalid] == :replace
# it ought to be filled out with all the metadata
# this exception usually has, but what a pain!
raise Encoding::InvalidByteSequenceError.new
else
options[:replace] || (
# surely there's a better way to tell if
# an encoding is a 'Unicode encoding form'
# than this? What's wrong with you ruby 1.9?
str.encoding.name.start_with?('UTF') ?
"\uFFFD" :
"?" )
end
end
end.join
end
More ranting at http://bibwild.wordpress.com/2012/04/17/checkingfixing-bad-bytes-in-ruby-1-9-char-encoding/

If you are doing this for a "real-life" use case - for example for parsing different strings entered by users, and not just for the sake of being able to "decode" a totally random file which could be made of as many encodings as you wish, then I guess you could at least assume that all charcters for each string have the same encoding.
Then, in this case, what would you think about this?
strings = [ "UTF-8 string with some utf8 chars \xC3\xB2 \xC3\x93",
"ISO-8859-1 string with some iso-8859-1 chars \xE0 \xE8", "..." ]
strings.each { |s|
s.force_encoding "utf-8"
if s.valid_encoding?
next
else
while s.valid_encoding? == false
s.force_encoding "ISO-8859-1"
s.force_encoding "..."
end
s.encode!("utf-8")
end
}
I am not a Ruby "pro" in any way, so please forgive if my solution is wrong or even a bit naive..
I just try to give back what I can, and this is what I've come to, while I was (I still am) working on this little parser for arbitrarily encoded strings, which I am doing for a study-project.
While I'm posting this, I must admit that I've not even fully tested it.. I.. just got a couple of "positive" results, but I felt so excited of possibly having found what I was struggling to find (and for all the time I spent reading about this on SO..) that I just felt the need to share it as quick as possible, hoping that it could help save some time to anyone who has been looking for this for as long as I've been... .. if it works as expected :)

A simple way to provoke an exception seems to be:
untrusted_string.match /./

Here are 2 common situations and how to deal with them in Ruby 2.1+. I know, the question refers to Ruby v1.9, but maybe this is helpful for others finding this question via Google.
Situation 1
You have an UTF-8 string with possibly a few invalid bytes
Remove the invalid bytes:
str = "Partly valid\xE4 UTF-8 encoding: äöüß"
str.scrub('')
# => "Partly valid UTF-8 encoding: äöüß"
Situation 2
You have a string that could be in either UTF-8 or ISO-8859-1 encoding
Check which encoding it is and convert to UTF-8 (if necessary):
str = "String in ISO-8859-1 encoding: \xE4\xF6\xFC\xDF"
unless str.valid_encoding?
str.encode!( 'UTF-8', 'ISO-8859-1', invalid: :replace, undef: :replace, replace: '?' )
end #unless
# => "String in ISO-8859-1 encoding: äöüß"
Notes
The above code snippets assume that Ruby encodes all your strings in UTF-8 by default. Even though, this is almost always the case, you can make sure of this by starting your scripts with # encoding: UTF-8.
If invalid, it is programmatically possible to detect most multi-byte encodings like UTF-8 (in Ruby, see: #valid_encoding?). However, it is NOT (easily) possible to programmatically detect invalidity of single-byte-encodings like ISO-8859-1. Thus the above code snippet does not work the other way around, i.e. detecting if a String is valid ISO-8859-1 encoding.
Even though UTF-8 has become increasingly popular as the default encoding in the web, ISO-8859-1 and other Latin1 flavors are still very popular in the Western countries, especially in North America. Be aware that there a several single-byte encodings out there that are very similar, but slightly vary from ISO-8859-1. Examples: CP1252 (a.k.a. Windows-1252), ISO-8859-15

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio