Comparing bytes in Ruby - ruby

I have a binary blob header of either a JPG or MP4 file. I am trying to differentiate between the two.
When the file is a JPG, the first two bytes are \xFF\xD8. However, when I make the comparison blob[0] == "\xFF", it fails. Even when I know that blob[0] IS in fact \xFF
What is the best way to do this?

This is an encoding issue. You are comparing a string with binary encoding (your JPEG blob) with a UTF-8 encoded string ("\xFF"):
foo = "\xFF".force_encoding("BINARY") # like your blob
bar = "\xFF"
p foo # => "\xFF"
p bar # => "\xFF"
p foo == bar # => false
There are several ways to create a binary encoded string:
str = "\xFF\xD8".b # => "\xFF\xD8" (Ruby 2.x)
str.encoding # => #<Encoding:ASCII-8BIT>
str = "\xFF\xD8".force_encoding("BINARY") # => "\xFF\xD8"
str.encoding # => #<Encoding:ASCII-8BIT>
str = 0xFF.chr + 0xD8.chr # => "\xFF\xD8"
str.encoding # => #<Encoding:ASCII-8BIT>
str = ["FFD8"].pack("H*") # => "\xFF\xD8"
str.encoding # => #<Encoding:ASCII-8BIT>
All of the above can be compared with your blob.

Related

Hash#compare_by_identity with string literals

I'm running Ruby 2.2.1.
The following code runs as expected as string hash keys are duped and frozen:
f = 'foo'
h = {f => 'bar'}
h.compare_by_identity
h[f] # => nil
h['foo'] # => nil
h[h.keys.first] # => "bar"
But I can't for the life of me figure out what is going on here:
h = {'foo' => 'bar'}
h.compare_by_identity
h.keys.first.frozen? # => true
'foo'.frozen? # => false
h.keys.first.object_id # => 20421220
'foo'.object_id # => 20067280
h['foo'] # => "bar"
h['foo'.dup] # => nil
It's interesting to note the the docs for #compare_by_identity started using #dup at 2.2.0. So it seems this behavior change is known.
2.1.7:
h1["a"] #=> nil # different objects.
2.2.0:
h1["a".dup] #=> nil # different objects.
However, the source is the same.
The same does not happen with other literals like arrays. Any ideas on why this behavior changed for string literals? The docs give no hints as to why.

Ruby encode UTF-8 string to UTF-16

I want to store the UTF-16 encoding into another variable as UTF-8 string.
1.9.3p194 :117 > str = "سلام"
=> "سلام"
1.9.3p194 :118 > enc = str.encode("utf-16")
=> "\uFEFF\u0633\u0644\u0627\u0645"
1.9.3p194 :119 > puts enc
??3D'E
=> nil
I want to store \uFEFF\u0633\u0644\u0627\u0645 (not ??3D'E) into a UTF-8 string so I can be able to concatenate it with other UTF-8 strings
Use String#inspect:
str = "سلام"
# => "سلام"
enc = str.encode("utf-16")
# => "\uFEFF\u0633\u0644\u0627\u0645"
puts enc
# output: ▒▒3D'E
# => nil
puts enc.inspect
# output: "\uFEFF\u0633\u0644\u0627\u0645"
# => nil

Trim a trailing .0

I have an Excel column containing part numbers. Here is a sample
As you can see, it can be many different datatypes: Float, Int, and String. I am using roo gem to read the file. The problem is that roo interprets integer cells as Float, adding a trailing zero to them (16431 => 16431.0). I want to trim this trailing zero. I cannot use to_i because it will trim all the trailing numbers of the cells that require a decimal in them (the first row in the above example) and will cut everything after a string char in the String rows (the last row in the above example).
Currently, I have a a method that checks the last two characters of the cell and trims them if they are ".0"
def trim(row)
if row[0].to_s[-2..-1] == ".0"
row[0] = row[0].to_s[0..-3]
end
end
This works, but it feels terrible and hacky. What is the proper way of getting my Excel file contents into a Ruby data structure?
def trim num
i, f = num.to_i, num.to_f
i == f ? i : f
end
trim(2.5) # => 2.5
trim(23) # => 23
or, from string:
def convert x
Float(x)
i, f = x.to_i, x.to_f
i == f ? i : f
rescue ArgumentError
x
end
convert("fjf") # => "fjf"
convert("2.5") # => 2.5
convert("23") # => 23
convert("2.0") # => 2
convert("1.00") # => 1
convert("1.10") # => 1.1
For those using Rails, ActionView has the number_with_precision method that takes a strip_insignificant_zeros: true argument to handle this.
number_with_precision(13.00, precision: 2, strip_insignificant_zeros: true)
# => 13
number_with_precision(13.25, precision: 2, strip_insignificant_zeros: true)
# => 13.25
See the number_with_precision documentation for more information.
This should cover your needs in most cases: some_value.gsub(/(\.)0+$/, '').
It trims all trailing zeroes and a decimal point followed only by zeroes. Otherwise, it leaves the string alone.
It's also very performant, as it is entirely string-based, requiring no floating point or integer conversions, assuming your input value is already a string:
Loading development environment (Rails 3.2.19)
irb(main):001:0> '123.0'.gsub(/(\.)0+$/, '')
=> "123"
irb(main):002:0> '123.000'.gsub(/(\.)0+$/, '')
=> "123"
irb(main):003:0> '123.560'.gsub(/(\.)0+$/, '')
=> "123.560"
irb(main):004:0> '123.'.gsub(/(\.)0+$/, '')
=> "123."
irb(main):005:0> '123'.gsub(/(\.)0+$/, '')
=> "123"
irb(main):006:0> '100'.gsub(/(\.)0+$/, '')
=> "100"
irb(main):007:0> '127.0.0.1'.gsub(/(\.)0+$/, '')
=> "127.0.0.1"
irb(main):008:0> '123xzy45'.gsub(/(\.)0+$/, '')
=> "123xzy45"
irb(main):009:0> '123xzy45.0'.gsub(/(\.)0+$/, '')
=> "123xzy45"
irb(main):010:0> 'Bobby McGee'.gsub(/(\.)0+$/, '')
=> "Bobby McGee"
irb(main):011:0>
Numeric values are returned as type :float
def convert_cell(cell)
if cell.is_a?(Float)
i = cell.to_i
cell == i.to_f ? i : cell
else
cell
end
end
convert_cell("foobar") # => "foobar"
convert_cell(123) # => 123
convert_cell(123.4) # => 123.4

How can I check a word is already all uppercase?

I want to be able to check if a word is already all uppercase. And it might also include numbers.
Example:
GO234 => yes
Go234 => no
You can compare the string with the same string but in uppercase:
'go234' == 'go234'.upcase #=> false
'GO234' == 'GO234'.upcase #=> true
a = "Go234"
a.match(/\p{Lower}/) # => #<MatchData "o">
b = "GO234"
b.match(/\p{Lower}/) # => nil
c = "123"
c.match(/\p{Lower}/) # => nil
d = "µ"
d.match(/\p{Lower}/) # => #<MatchData "µ">
So when the match result is nil, it is in uppercase already, else something is in lowercase.
Thank you #mu is too short mentioned that we should use /\p{Lower}/ instead to match non-English lower case letters.
I am using the solution by #PeterWong and it works great as long as the string you're checking against doesn't contain any special characters (as pointed out in the comments).
However if you want to use it for strings like "Überall", just add this slight modification:
utf_pattern = Regexp.new("\\p{Lower}".force_encoding("UTF-8"))
a = "Go234"
a.match(utf_pattern) # => #<MatchData "o">
b = "GO234"
b.match(utf_pattern) # => nil
b = "ÜÖ234"
b.match(utf_pattern) # => nil
b = "Über234"
b.match(utf_pattern) # => #<MatchData "b">
Have fun!
You could either compare the string and string.upcase for equality (as shown by JCorc..)
irb(main):007:0> str = "Go234"
=> "Go234"
irb(main):008:0> str == str.upcase
=> false
OR
you could call arg.upcase! and check for nil. (But this will modify the original argument, so you may have to create a copy)
irb(main):001:0> "GO234".upcase!
=> nil
irb(main):002:0> "Go234".upcase!
=> "GO234"
Update: If you want this to work for unicode.. (multi-byte), then string#upcase won't work, you'd need the unicode-util gem mentioned in this SO question

Why are two strings with same bytes and encoding not identical in Ruby 1.9?

In Ruby 1.9.2, I found a way to make two strings that have the same bytes, same encoding, and are equal, but they have a different length and different characters returned by [].
Is this a bug? If it is not a bug, then I'd like to fully understand it. What kind of information is stored inside Ruby 1.9.2 String objects that allows these two strings to behave differently?
Below is the code that reproduces this behavior. The comments that start with #=> show you what output I am getting from this script, and the parenthetical words tell you my judgment of that output.
#!/usr/bin/ruby1.9
# coding: utf-8
string1 = "\xC2\xA2" # A well-behaved string with one character (¢)
string2 = "".concat(0xA2) # A bizarre string very similar to string1.
p string1.bytes.to_a #=> [194, 162] (good)
p string2.bytes.to_a #=> [194, 162] (good)
puts string1.encoding.name #=> UTF-8 (good)
puts string2.encoding.name #=> UTF-8 (good)
puts string1 == string2 #=> true (good)
puts string1.length #=> 1 (good)
puts string2.length #=> 2 (weird!)
p string1[0] #=> "¢" (good)
p string2[0] #=> "\xC2" (weird!)
I am running Ubuntu and compiled Ruby from source. My Ruby version is:
ruby 1.9.2p0 (2010-08-18 revision 29036) [x86_64-linux]
It is Ruby's bug and fixed r29848.
Matz mentioned this question via Twitter:
http://twitter.com/matz_translator/status/6597021662187520
http://twitter.com/matz_translator/status/6597055132733440
"It's hard to determine as a bug but, it's not acceptable to leave it as is. I'd prefer to fix this issue."
I think the problem is in the string's encoding. Check out James Grey's Shades of Gray: Ruby 1.9's String article on Unicode encoding.
Additional odd behavior:
# coding: utf-8
string1 = "\xC2\xA2"
string2 = "".concat(0xA2)
string3 = 0xC2.chr + 0xA2.chr
string1.bytes.to_a # => [194, 162]
string2.bytes.to_a # => [194, 162]
string3.bytes.to_a # => [194, 162]
string1.encoding.name # => "UTF-8"
string2.encoding.name # => "UTF-8"
string3.encoding.name # => "ASCII-8BIT"
string1 == string2 # => true
string1 == string3 # => false
string2 == string3 # => true
string1.length # => 1
string2.length # => 2
string3.length # => 2
string1[0] # => "¢"
string2[0] # => "\xC2"
string3[0] # => "\xC2"
string3.unpack('C*') # => [194, 162]
string4 = string3.unpack('C*').pack('C*') # => "\xC2\xA2"
string4.encoding.name # => "ASCII-8BIT"
string4.force_encoding('UTF-8') # => "¢"
string3.force_encoding('UTF-8') # => "¢"
string3.encoding.name # => "UTF-8"

Resources