What is the difference between \d and \p{Digit}? - ruby

While I have been using \p{Alpha} and \p{Space} for quite some time in my regular expressions I just came across \p{Digit}, but I couldn't find any information about what the up- or downsides are compared to the normal \d that I normally use. What are the key differences between those to?

\d matches only ASCII digits, i.e. it is equivalent to the class [0-9]. \p{Digit} matches the same characters as \d plus any other Unicode character that represents a digit. For example to match the arabic zero (code point U+0660):
"\u0660"
# => "٠"
"\u0660" =~ /\d/
# => nil
"\u0660" =~ /\p{Digit}/
# => 0

Related

How do I specify in Ruby that I want to match a character provided that a sequence following that character does not match a pattern?

I'm using Ruby on Rails 5.1. In Ruby, how do I say taht I want to match a string if the first character matches something but the sequence that follows does NOT match a pattern? That is, I want to match a number provided that the sequence taht follows is not a character from an array I have followed by two other numbers. Here's my character array ...
2.4.0 :010 > TOKENS
=> [":", ".", "'"]
So this string would NOT match
3:00
since ":00" matches the pattern of a character from my array followed by two numbers. But this string
3400
would match. This string would also match
3:0
and this would match
3
since nothing follows the above. How do I write the appropriate regex in Ruby?
string =~ /\A\d+(?!:\d{2})/
This regular expression means:
\A anchors the match to the start of the string.
\d+ means "one or more digits".
(?!...) is a negative look-ahead. It checks that the pattern contained in the brackets does not match., looking ahead from the current position.
:\d{2} means : followed by two digits.
Consideration should be given to testing the first character and the remaining characters separately.
def match_it?(str, first_char_regex, no_match_regex)
str[0].match?(first_char_regex) && !str[1..-1].match?(no_match_regex)
end
match_it?("0:00", /0/, /\A[:. ]cat\z/) #=> true
match_it?("0:00", /\d/, /\A[:. ]\d+\z/) #=> false
match_it?("0:00", /[[:alpha:]]/, /\A[:. ]\d+\z/) #=> false
I believe this reads well and it simplifies testing when compared to methods that employ a single regular expression.

Regex to find strings with only letters or numbers or both

I am searching for strings with only letters or numbers or both. How could I write a regex for that?
You can use following regex to check if the string contains letters and/or numbers
^[a-zA-Z0-9]+$
Explanation
^: Starts with
[]: Character class
a-zA-Z: Matches any alphabet
0-9: Matches any number
+: Matches previous characters one or more time
$: Ends with
RegEx101 Demo
"abc&#*(2743438" !~ /[^a-z0-9]/i # => false
"abc2743438" !~ /[^a-z0-9]/i # => true
This example let to avoid multiline anchors use (^ or $) (which may present a security risk) so it's better to use \A and \z, or to add the :multiline => true option in Rails.
Only letters and numbers:
/\A[a-zA-Z0-9]+\z/
Or if you want to leave - and _ chars also:
/\A[a-zA-Z0-9_\-]+\z/

Ruby specifying regexp

How would I write a regexp so that the string MUST equal the exact format in the regexp?
For example:
/\d:\d/ =~ 5:4
BUT
/\d:\d/ is also equal to 5:42alskjf2425
how do I make it so that my regexp checks for only a digit, followed by a colon, followed by a digit, and nothing else?
Thanks.
Use \A and \z anchors, to match the beginning and end of a string:
/\A\d:\d\z/ =~ '5:4' # => 0 (boolean true)
/\A\d:\d\z/ =~ '5:4x' # => nil (boolean false)
If you need to specify how many characters must be found, you can do it a couple ways:
\d finds one.
\d{1} finds one.
\d{1,2} finds one or two.
\d{1,} finds one or more.
\d{,2} finds zero, one or two.
In other words, use:
/\d{1}:\d{1}/
Check it out:
'5:4'[/\d{1}:\d{1}/] # => "5:4"
'5:42alskjf2425'[/\d{1}:\d{1}/] # => "5:4"
That's all documented so take the time to read through the Regexp documentation.

How to detect this unprintable character using regex or other ways in Ruby?

I came upon a strange character (using Nokogiri).
irb(main):081:0> sss.dump
=> "\"\\u{a0}\""
irb(main):082:0> puts sss
=> nil
irb(main):083:0> sss
=> " "
irb(main):084:0> sss =~ /\s/
=> nil
irb(main):085:0> sss =~ /[[:print:]]/
=> 0
irb(main):087:0> sss == ' '
=> false
irb(main):088:0> sss.length
=> 1
Any idea what is this strange character?
When it's displayed in a webpage, it's a white space, but it doesn't match a whitespace \s
using regular expression. Ruby even thinks it's a printable character!
How do I detect characters like this and exclude them or flag them as whitespace (if possible)?
Thanks
It's the non-breaking space. In HTML, it's used pretty frequently and often written as . One way to find out the identity of a character like "\u{a0}" is to search the web for U+00A0 (using four or more hexadecimal digits) because that's how the Unicode specification notates Unicode code points.
The non-breaking space and other things like it are included in the regex /[[:space:]]/.

How to use double brackets in a regular expression?

What do double square brackets mean in a regex? I am confused about the following examples:
/[[^abc]]/
/[^abc]/
I was testing using Rubular, but I didn't see any difference between the one with double brackets and single brackets.
Posix character classes use a [:alpha:] notation, which are used inside a regular expression like:
/[[:alpha:][:digit:]]/
You'll need to scroll down a ways to get to the Posix information in the link above. From the docs:
POSIX bracket expressions are also similar to character classes. They provide a portable alternative to the above, with the added benefit that they encompass non-ASCII characters. For instance, /\d/ matches only the ASCII decimal digits (0-9); whereas /[[:digit:]]/ matches any character in the Unicode Nd category.
/[[:alnum:]]/ - Alphabetic and numeric character
/[[:alpha:]]/ - Alphabetic character
/[[:blank:]]/ - Space or tab
/[[:cntrl:]]/ - Control character
/[[:digit:]]/ - Digit
/[[:graph:]]/ - Non-blank character (excludes spaces, control characters, and similar)
/[[:lower:]]/ - Lowercase alphabetical character
/[[:print:]]/ - Like [:graph:], but includes the space character
/[[:punct:]]/ - Punctuation character
/[[:space:]]/ - Whitespace character ([:blank:], newline,
carriage return, etc.)
/[[:upper:]]/ - Uppercase alphabetical
/[[:xdigit:]]/ - Digit allowed in a hexadecimal number (i.e., 0-9a-fA-F)
Ruby also supports the following non-POSIX character classes:
/[[:word:]]/ - A character in one of the following Unicode general categories Letter, Mark, Number, Connector_Punctuation
/[[:ascii:]]/ - A character in the ASCII character set
# U+06F2 is "EXTENDED ARABIC-INDIC DIGIT TWO"
/[[:digit:]]/.match("\u06F2") #=> #<MatchData "\u{06F2}">
/[[:upper:]][[:lower:]]/.match("Hello") #=> #<MatchData "He">
/[[:xdigit:]][[:xdigit:]]/.match("A6") #=> #<MatchData "A6">
'[[' doesn't have any special meaning. [xyz] is a character class and will match a single x, y or z. The carat ^ takes all characters not in the brackets.
Removing ^ for simplicity, you can see that the first open bracket is being matched with the first close bracket and the second closed bracket is being used as part of the character class. The final close bracket is treated as another character to be matched.
irb(main):032:0> /[[abc]]/ =~ "[a]"
=> 1
irb(main):033:0> /[[abc]]/ =~ "a]"
=> 0
This appears to have the same result as your original in some cases
irb(main):034:0> /[abc]/ =~ "a]"
=> 0
irb(main):034:0> /[abc]/ =~ "a"
=> 0
But this is only because your regular expression is not looking for an exact match.
irb(main):036:0> /^[abc]$/ =~ "a]"
=> nil

Resources