def normalized?
matches = match(/[^A-Z]*/)
return matches.size == 0
end
This is my function operating on a string, checking wether a string contains only uppercase letters. It works fine ruling out non matches, but when i call it on a string like "ABC" it says no match, because apparently matches.size is 1 and not zero. There seems to be an empty element in it or so.
Can anybody explain why?
Your regex is wrong - if you want it to match ONLY uppercase strings, use /^[A-Z]+$/.
Your regular expression is incorrect. /[^A-Z]*/ means "match zero or more characters that are not between A and Z, anywhere in the string". The string ABC has zero characters that are not between A and Z, so it matches the regular expression.
Change your regular expression to /^[^A-Z]+$/. This means "match one or more characters that are not between A and Z, and make sure every character between the beginning and end of the string are not between A and Z". Then the string ABC will not match, and then you can check matches[0].size or whatever, as per sepp2k's answer.
MatchData#size returns the number of capturing groups in the regex plus one, so that md[i] will access a valid group iff i < md.size. So the value returned by size only depends on the regex, not the matched string, and will never be 0.
You want matches.to_s.size or matches[0].size.
ruby-1.9.2-p180> def normalized? s
ruby-1.9.2-p180?> s.match(/^[[:upper:]]+$/) ? true : false
ruby-1.9.2-p180?> end
=> nil
ruby-1.9.2-p180> normalized? "asdf"
=> false
ruby-1.9.2-p180> normalized? "ASDF"
=> true
The * in your regular expression means that it matches any number of non-uppercase characters, including zero. So it always matches everything. The fix is to remove the *, then it will fail to match a string containing only uppercase characters. (Although you would need a different test if zero-length strings are not permitted.)
If you want to know that the input string entirely consists of English uppercase letters, i.e. A-Z, then you must remove the Kleene Star as it will match before and after every single character in any input string (zero length match). The statement !s[/[^A-Z]/] tells you if there's no match of non-A-to-Z characters:
irb(main):001:0> def normalized? s
irb(main):002:1> return !s[/[^A-Z]/]
irb(main):003:1> end
=> nil
irb(main):004:0> normalized? "ABC"
=> true
irb(main):005:0> normalized? "AbC"
=> false
irb(main):006:0> normalized? ""
=> true
irb(main):007:0> normalized? "abc"
=> false
There is only 1 regular expression that defines a string with only and All capitals:
def onlyupper(s)
(s =~ /^[A-Z]+$/) != nil
end
Truth table:
/[^A-Z]*/:
Testing 'asdf' matched 'asdf' length 4
Testing 'HHH' matched '' length 0
Testing '' matched '' length 0
Testing '-=AAA' matched '-=' length 2
--------
/[^A-Z]+/:
Testing 'asdf' matched 'asdf' length 4
Testing 'HHH' matched nil
Testing '' matched nil
Testing '-=AAA' matched '-=' length 2
--------
/^[^A-Z]*$/:
Testing 'asdf' matched 'asdf' length 4
Testing 'HHH' matched nil
Testing '' matched '' length 0
Testing '-=AAA' matched nil
--------
/^[^A-Z]+$/:
Testing 'asdf' matched 'asdf' length 4
Testing 'HHH' matched nil
Testing '' matched nil
Testing '-=AAA' matched nil
--------
/^[A-Z]*$/:
Testing 'asdf' matched nil
Testing 'HHH' matched 'HHH' length 3
Testing '' matched '' length 0
Testing '-=AAA' matched nil
--------
/^[A-Z]+$/:
Testing 'asdf' matched nil
Testing 'HHH' matched 'HHH' length 3
Testing '' matched nil
Testing '-=AAA' matched nil
--------
This question needs a more clear answer. As tchrist commented, I wish he would have answered. The "Regex for matching capitals" is to use:
/\p{Uppercase}/
As tchrist mentions "is distinct from the general category \p{Uppercase_Letter} aka \p{Lu}. That’s because there exist non-Letters that count as Uppercase"
Related
I'm trying to use the match method with an argument of a regex to select a valid phone number, by definition, any string with nine digits.
For example:
9347584987 is valid,
(456)322-3456 is valid,
(324)5688890 is valid.
But
(340)HelloWorld is NOT valid and
456748 is NOT valid.
So far, I'm able to use \d{9} to select the example string of 9 digit characters in a row, but I'm not sure how to specifically ignore any character, such as '-' or '(' or ')' in the middle of the sequence.
What kind of Regex could I use here?
Given:
nums=['9347584987','(456)322-3456','(324)5688890','(340)HelloWorld', '456748 is NOT valid']
You can split on a NON digit and rejoin to remove non digits:
> nums.map {|s| s.split(/\D/).join}
["9347584987", "4563223456", "3245688890", "340", "456748"]
Then filter on the length:
> nums.map {|s| s.split(/\D/).join}.select {|s| s.length==10}
["9347584987", "4563223456", "3245688890"]
Or, you can grab a group of numbers that look 'phony numbery' by using a regex to grab digits and common delimiters:
> nums.map {|s| s[/[\d\-()]+/]}
["9347584987", "(456)322-3456", "(324)5688890", "(340)", "456748"]
And then process that list as above.
That would delineate:
> '123 is NOT a valid area code for 456-7890'[/[\d\-()]+/]
=> "123" # no match
vs
> '123 is NOT a valid area code for 456-7890'.split(/\D/).join
=> "1234567890" # match
I suggest using one regular expression for each valid pattern rather than constructing a single regex. It would be easier to test and debug, and easier to maintain the code. If, for example, "123-456-7890" or 123-456-7890 x231" were in future deemed valid numbers, one need only add a single, simple regex for each to the array VALID_PATTERS below.
VALID_PATTERS = [/\A\d{10}\z/, /\A\(\d{3}\)\d{3}-\d{4}\z/, /\A\(\d{3}\)\d{7}\z/]
def valid?(str)
VALID_PATTERS.any? { |r| str.match?(r) }
end
ph_nbrs = %w| 9347584987 (456)322-3456 (324)5688890 (340)HelloWorld 456748 |
ph_nbrs.each { |s| puts "#{s.ljust(15)} \#=> #{valid?(s)}" }
9347584987 #=> true
(456)322-3456 #=> true
(324)5688890 #=> true
(340)HelloWorld #=> false
456748 #=> false
String#match? made its debut in Ruby v2.4. There are many alternatives, including str.match(r) and str =~ r.
"9347584987" =~ /(?:\d.*){9}/ #=> 0
"(456)322-3456" =~ /(?:\d.*){9}/ #=> 1
"(324)5688890" =~ /(?:\d.*){9}/ #=> 1
"(340)HelloWorld" =~ /(?:\d.*){9}/ #=> nil
"456748" =~ /(?:\d.*){9}/ #=> nil
Pattern: (Rubular Demo)
^\(?\d{3}\)?\d{3}-?\d{4}$ # this makes the expected symbols optional
This pattern will ensure that an opening ( at the start of the string is followed by 3 numbers the a closing ).
^(\(\d{3}\)|\d{3})\d{3}-?\d{4}$
On principle, though, I agree with melpomene in advising that you remove all non-digital characters, test for 9 character length, then store/handle the phone numbers in a single/reliable/basic format.
I'm crawling data from website. And this is string I received when I parse Html by Nokogiri
"0:10\r\n (+1)\r\n "
"03:10\r\n (+1)\r\n "
How can I get only "0:10" and "03:10" ?
UPDATE
And what's different between match and gsub ?
Thanks !
Your regular expression should only match strings that have the required pattern.
r = /
\A # match beginning of string
( # begin capture group 1
\d+ # match one or more digits
: # match a colon
\d{2} # match two digits
) # end capture group 1
\r\n\s+\(\+1\)\r\n\s+ # match substring
\z # match end of string
/x # free spacing regex definition mode
"0:10\r\n (+1)\r\n "[r,1]
#=> "0:10"
"03:10\r\n (+1)\r\n "[r,1]
#=> "03:10"
"0:101\r\n (+1)\r\n "[r,1]
#=> nil
":10\r\n (+1)\r\n "[r,1]
#=> nil
"0:10 \r\n (+1)\r\n "[r,1]
#=> nil
"0:10\r\n (+2)\r\n "[r,1]
#=> nil
"0:10\r\n (+1)\r\n cat"[r,1]
#=> nil
Depending on how the string may vary, some changes may be necessary to your pattern. For example, If "+1" in parentheses might be "+" followed by any positive number, you would need to replace \(\+1\) with \(\+\d+\).
You shoud use the regex /\d{0,2}:\d{0,2}/ #engineer14 posted. It works, here's proof:
console.log("0:10\r\n (+1)\r\n ".match(/\d{0,2}:\d{0,2}/)[0])
console.log("03:10\r\n (+1)\r\n ".match(/\d{0,2}:\d{0,2}/)[0])
Explanation:
/ <-- open regex
\d <-- look for digit
{0,2} <-- zero or more of them
: <-- look for a colon
\d <-- look for another digit
{0,2} <-- zero or more of them
/ <-- close regex
what site are you crawling? the +1 might be important to consider if it is a timezone.
I'm doing a regex to check a slug.
Actually my regex is : /^[^-][a-z\-].*[^-]+$/
here's what I'm checking right now :
my-awesome-project => valid
-my-awesome-project => invalid
my-awesome-project- => invalid
Now what I want is to check if the dash is repeating or not :
my-awesome-project => should be valid
my-awesome--project => should not be valid
my----awesome-project => should not be valid
Can I do that with a regex ?
Thank you,
I think this regexp should work:
/^[a-z]+(-[a-z]+)*$/
What this does: ^[a-z]+ matches if the string begins with at least on character. After that there may be (-[a-z]+)*$ zero or more occurances of a dash followed by again at least one character.
See on Rubular.
As I understand, the string is valid unless it:
contains a character other than a lower-case letter or hyphen,
begins with a hyphen,
ends with a hyphen, or
contains two (or more) hyphens in a row.
If that's the case, it's easiest to check if it invalid:
R = /
[^a-z-] # match one character other than a lower-case letter or hyphen
| # or
^- # match a hyphen as the first character
| # or
-$ # match a hyphen as the last character
| # or
-- # match two hypens
/x
def valid?(str)
str !~ R
end
valid? 'my-awesome-project' #=> true
valid? '-my-awesome-project' #=> false
valid? 'my-awesome-project-' #=> false
valid? 'my-awesome--project' #=> false
valid? 'my----awesome-project' #=> false
Below regex may be helpful.
[a-zA-Z0-9]+(-[a-zA-Z0-9]+)*
In Python I can do this:
import re
regex = re.compile('a')
regex.match('xay',1) # match because string starts with 'a' at 1
regex.match('xhay',1) # no match because character at 1 is 'h'
However in Ruby, the match method seems to match everything past the positional argument. For instance, /a/.match('xhay',1) will return a match, even though the match actually starts at 2. However, I want to only consider matches that start at a specific position.
How do I get a similar mechanism in Ruby? I would like to match patterns that start at a specific position in the string as I can in Python.
/^.{1}a/
for matching a at location x+1 in the string
/^.{x}a/
--> DEMO
How about below using StringScanner ?
require 'strscan'
scanner = StringScanner.new 'xay'
scanner.pos = 1
!!scanner.scan(/a/) # => true
scanner = StringScanner.new 'xnnay'
scanner.pos = 1
!!scanner.scan(/a/) # => false
Regexp#match has an optional second parameter pos, but it works like Python's search method. You could however check if the returned MatchData begins at the specified position:
re = /a/
match_data = re.match('xay', 1)
match_data.begin(0) == 1
#=> true
match_data = re.match('xhay', 1)
match_data.begin(0) == 1
#=> false
match_data = re.match('áay', 1)
match_data.begin(0) == 1
#=> true
match_data = re.match('aay', 1)
match_data.begin(0) == 1
#=> true
Extending a little bit on what #sunbabaphu answered:
def matching_at_pos(x=0, regex)
/\A.{#{x-1}}#{regex}/
end # note the position is 1 indexed
'xxa' =~ matching_at_pos(2, /a/)
=> nil
'xxa' =~ matching_at_pos(3, /a/)
=> 0
'xxa' =~ matching_at_pos(4, /a/)
=> nil
The answer to this question is \G.
\G matches the starting point of the regex match, when calling the two-argument version of String#match that takes a starting position.
'xay'.match(/\Ga/, 1) # match because /a/ starts at 1
'xhay'match(/\Ga/, 1) # no match because character at 1 is 'h'
What is the reason behind different results between the following regexp statements:
"abbcccddddeeee"[/z*/] # => ""
And these that return nil:
"some matching content"[/missing/] # => nil
"start end"[/\Aend/] # => nil
What's happening is that /z*/ will return zero or more occurrences of z.
If you use /z+/, which returns one or more, you'll see it returns nil as expected.
The regular expression /z*/ matches 0 or more z characters, so it also matches an empty string at the beginning of your string. Consider this:
"abbcccddddeeee" =~ /z*/
# => 0
Thus String#[] returns the matched empty string.
In your second example the expressions /missing/ and /\Aend/ don't match anything so nil is returned.
* wild-card stands for 0 or more matches so even if your z is not present it will show a empty string match. on the other hand you can use + for 1 or more and ? for zero or more matches.