Ruby: hexadecimal in regular expressions - ruby

I need to match an md5 checksum in a regular expression in a Ruby (actually Rails) program. I found out somewhere that I can match hexadecimal strings with \h sequence, but I can't find the link anymore.
I'm using that sequence and my code is working in Ruby 1.9.2. I can make it working even under plain IRB (so it's not a Rails extension).
ruby-1.9.2-p180 :007 > "123abcdf" =~ /^\h+$/; $~
=> #<MatchData "123abcdf">
ruby-1.9.2-p180 :008 > "123abcdfg" =~ /^\h+$/; $~
=> nil
However my IDE mark that expression as wrong and I can't find any reference which cites that sequence.
Is the \h sequence legal in Ruby Regex under any environment/version or should I trust my ide and replace it with something like [abcdef\d]?

Yes it is. Check the official doc for the complete documentation for regex in Ruby.
Note that \h will match uppercase letters too, so it's actually equivalent to [a-fA-F\d]

According to this \h is part of oniguruma, which I believe is standard in ruby 1.9.

Related

Regular Expressions metacharacters, \< and \>, no longer supported in latest Ruby version?

When I investigated this in irb, I found that the metacharacters, \< and \>, returns nil when I expected a value. Under the cheatsheet I'm using, these metacharacters are called "start-of-word" and "end-of-word" respectively. But don't they function the same as "word boundaries"?
It seems to hold true for the examples in "Mastering Regular Expression" by Friedl.
irb(main):001:0> "this cat is fat" =~ /\bcat\b/
=> 5
irb(main):002:0> "this cat is fat" =~ /\<cat\>/
=> nil
irb(main):003:0> "cat" =~ /\<cat\>/
=> nil
That's entirely possible. As of Ruby 1.9, Ruby switched to Oniguruma for regular expression parsing. It's possible that prior to 1.9, \< and > were valid.
However, in researching this, I found that they are listed as a specifically GNU addition to the regex language.
Playing with it in Rubular, which supports running a regex through several different ruby implementations, I couldn't get \< or > to work in any version. \b seems to be the more standard way of specifying a word boundary...
Ruby has always used \b for word boundaries, just like Perl and the other Perl-derived flavors (JavaScript, .NET, Python, etc.). It's only GNU tools like egrep that use \< and \>.
Is that the AddedBytes cheat sheet you're using? It contains several errors, that being one of them. Elsewhere it says < and > are metacharacters and you have to use \< and \> to match them literally; how's that for a Catch-22?

Converting UTF-8 characters into properly ASCII characters

I have the string "V\355ctor" (I think that's Víctor).
Is there a way to convert it to ASCII where í would be replaced by an ASCII i?
I already have tried Iconv without success.
(I'm only getting Iconv::IllegalSequence: "\355ctor")
Further, are there differences between Ruby 1.8.7 and Ruby 2.0?
EDIT:
Iconv.iconv('UTF-8//IGNORE', 'UTF-8', "V\355ctor") this seems to work but the result is Vctor not Victor
I know of two options.
transliterate from the I18n gem.
$ irb
1.9.3-p448 :001 > string = "Víctor"
=> "Víctor"
1.9.3-p448 :002 > require 'i18n'
=> true
1.9.3-p448 :003 > I18n.transliterate(string)
=> "Victor"
Unidecoder from the stringex gem.
Stringex::Unidecoder..decode(string)
Update:
When running Unidecoder on "V\355ctor", you get the following error:
Encoding::CompatibilityError: incompatible encoding regexp match (UTF-8 regexp with IBM437 string)
Hmm, maybe you want to first translate from IBM437:
string.force_encoding('IBM437').encode('UTF-8')
This may help you get further. Note that the autodetected encoding could be incorrect, if you know exactly what the encoding is, it would make everything a lot easier.
What you want to do is called transliteration.
The most used and best maintained library for this is ICU. (Iconv is frequently used too, but it has many limitations such as the one you ran into.)
A cursory Google search yields a few ruby ICU wrappers. I'm afraid I cannot comment on which one is better, since I've admittedly never used any of them. But that is the kind of stuff you want to be using.

Ruby Regex Different from Rubular

I'm trying the same super simple regex on Rubular.com and my VM Linux with Ruby 1.9.2 I dont' know why I'm getting different outputs:
VM:
my_str = "Madam Anita"
puts my_str[/\w/]
this Outputs: Madam
on Rubular it outputs: MadamAnita
Rubular:
http://www.rubular.com/r/qyQipItdes
I would love some help. I stuck here. I will not be able to test my code for the hw1.
No, it doesn't really. It matches all characters in "Madam" and "Anita", but not the space. The problem you are having is that my_str[/\w/] only returns a single match for the given regular expression, whereas Rubular highlights all possible matches.
If you need all occurrences, you could do this:
1.9.3p194 :002 > "Madam Anita".scan(/\w+/)
=> ["Madam", "", "Anita", ""]
Actually, \w matches a single character. The result in Rubular contains spaces between adjacent characters to tell you this (though I wish they'd also make the highlighting more obvious...). Compare with the output from matching \w+, which matches two strings (Madam and Anita).

ruby incorrect method behavior (possible depending charset)

I got weird behavior from ruby (in irb):
irb(main):002:0> pp "    LS 600"
"\302\240\302\240\302\240\302\240LS 600"
irb(main):003:0> pp "    LS 600".strip
"\302\240\302\240\302\240\302\240LS 600"
That means (for those, who don't understand) that strip method does not affect this string at all, same with gsub('/\s+/', '')
How can I strip that string (I got it while parsing Internet page)?
The string "\302\240" is a UTF-8 encoded string (C2 A0) for Unicode code point A0, which represents a non breaking space character. There are many other Unicode space characters. Unfortunately the String#strip method removes none of these.
If you use Ruby 1.9.2, then you can solve this in the following way:
# Ruby 1.9.2 only.
# Remove any whitespace-like characters from beginning/end.
"\302\240\302\240LS 600".gsub(/^\p{Space}+|\p{Space}+$/, "")
In Ruby 1.8.7 support for Unicode is not as good. You might be successful if you can depend on Rails's ActiveSupport::Multibyte. This has the advantage of getting a working strip method for free. Install ActiveSupport with gem install activesupport and then try this:
# Ruby 1.8.7/1.9.2.
$KCODE = "u"
require "rubygems"
require "active_support/core_ext/string/multibyte"
# Remove any whitespace-like characters from beginning/end.
"\302\240\302\240LS 600".mb_chars.strip.to_s

Why does "?b" mean 'b' in Ruby?

"foo"[0] = ?b # "boo"
I was looking at the above example and trying to figure out:
How "?b" implies the character 'b'?
Why is it necessary? - Couldn't I just write this:
"foo"[0] = 'b' # "boo"
Ed Swangren: ? returns the character code of a
character.
Not in Ruby 1.9. As of 1.9, ?a returns 'a'. See here: Son of 10 things to be aware of in Ruby 1.9!
telemachus ~ $ ~/bin/ruby -v
ruby 1.9.1p0 (2009-01-30 revision 21907) [i686-linux]
telemachus ~ $ ~/bin/ruby -e 'char = ?a; puts char'
a
telemachus ~ $ /usr/bin/ruby -v
ruby 1.8.7 (2008-08-11 patchlevel 72) [i486-linux]
telemachus ~ $ /usr/bin/ruby -e 'char = ?a; puts char'
97
Edit: A very full description of changes in Ruby 1.9.
Another edit: note that you can now use 'a'.ord if you want the string to number conversion you get in 1.8 via ?a.
The change is related to Ruby 1.9's UTF-8 updates.
The Ruby 1.8 version of ? only worked with single-byte characters. In 1.9, they updated everything to work with multi-byte characters. The trouble is, it's not clear what integer should return from ?€.
They solved it by changing what it returns. In 1.9, all of the following are single-element strings and are equivalent:
?€
'€'
"€"
"\u20AC"
?\u20AC
They should have dropped the notation, IMO, rather than (somewhat randomly) changing the behavior. It's not even officially deprecated, though.
? returns the character code of a character. Here is a relevant post on this.
In some languages (Pascal, Python), chars don't exist: they're just length-1 strings.
In other languages (C, Lisp), chars exist and have distinct syntax, like 'x' or #\x.
Ruby has mostly been on the side of "chars don't exist", but at times has seemed to not be entirely sure of this choice. If you do want chars as a data type, Ruby already assigns meaning to '' and "", so ?x seems about as reasonable as any other option for char literals.
To me, it's simply a matter of saying what you mean. You could just as well say foo[0]=98, but you're using an integer when you really mean a character. Using a string when you mean a character looks equally strange to me: the set of operations they support is almost completely different. One is a sequence of the other. You wouldn't make Math.sqrt take a list of numbers, and just happen to only look at the first one. You wouldn't omit "integer" from a language just because you already support "list of integer".
(Actually, Lisp 1.0 did just that -- Church numerals for everything! -- but performance was abysmal, so this was one of the huge advances of Lisp 1.5 that made it usable as a real language, back in 1962.)

Resources