Why does this regex not match numbers and single letters? - ruby

Why does this regex not match 3a?
(\/\d{1,4}?|\d{1,4}?|\d{1,4}[A-z]{1})
Using \d{1,4}\D{1}, the result is the same.
Streets numbers:
/1
78
3a
89/
-1 (special case)
1
https://regex101.com/r/cYCafR/3

The digits+letter combination is not matched due to the order of alternatives in your pattern. The \d{1,4}? matches the digit before the letter, and \d{1,4}[A-z]{1} does not even have a chance to step in. See the Remember That The Regex Engine Is Eager article.
The \/\d{1,4}? will match a / and a single digit after the slash, and \d{1,4}? will always match a single digit, as {min,max}? is a lazy range/interval/limiting quantifier and as such only matches as few chars as possible. See Laziness Instead of Greediness.
Besides, [A-z] is a typo, it should be [A-Za-z].
It seems you want
\d{1,4}[A-Za-z]|\/?\d{1,4}
See the regex demo. If it should be at the start of a line, use
^(?:\d{1,4}[A-Za-z]|\/?\d{1,4})
See this regex demo.
Details
^ - start of a line
(?: - start of a non-capturing group
\d{1,4}[A-Za-z] - 1 to 4 digits and an ASCII letter
| - or
\/? - an optional /
\d{1,4} - 1 to 4 digits
) - end of the group.

Your regex uses lazy quantifiers like {1,4}?. These will match one character, and stop, because the rest of the pattern (i.e. nothing) matches the rest of the string. See here for how greedy vs lazy quantifiers work.
Another reason is that you put the \d{1,4}[A-z]{1} case last. This case will only be tried if the first two cases don't match. With 3a, the 3 already matches the second case, so the last case won't be considered.
You seem to just want:
^(\d{1,4}[A-Za-z]|\/?\d{1,4})
Note how the \/\d{1,4} case and the \d{1,4} case in your original regex are combined into one case \/?\d{1,4}.

Related

Regex for two letters and up to 13 digits incorrectly accepts additional letters

I am trying to build a regexp for a model field that follows this rule:
starts with two letters
can be filled up with digits, up to 13 digits
Valid examples:
US333
FR52389000
Invalid examples:
11111
T11
I thought I found the right regex:
/[a-zA-Z][a-zA-Z]\d*/
But proof testing it with http://rubular.com/ seems to validate RR444kjj
Can someone point out the mistake?
You need to use a limiting quantifier with \d and correct anchors.
/\A[[:alpha:]]{2}\d{0,13}\z/
See the regex demo.
\A - start of the string (note that the ^ anchor matches the start of the line in Ruby regex)
[[:alpha:]]{2} - 2 letters (to make sure you only allow ASCII letters, use [a-zA-Z]{2})
\d{0,13} - 0 to 13 digits
\z - end of string (note that the $ anchor matches the end of the line in Ruby regex).

Regular Expression(Ruby): match non-word, but exclude spaces

^(?=(.*\d){4,})(?=(.*[A-Z]){3})(?!\s)(?=.*\W{2,})(?=(.*[a-z]){2,}).{12,14}$
The RegExp above is trying to:
match at least 4 digits - (?=(.*\d){4,})
match exactly 3 upper case letters - (?=(.*[A-Z]){3})
don't match spaces - (?!\s)
match at least 2 non-word characters - (?=.*\W{2,})
match at least 2 lower - (?=(.*[a-z]){2,})
string must be between 12 and 14 in length - .{12,14}
But I am having a challenge getting this to avoid matching spaces. It seems like because \W also includes spaces, my preceding negative look-ahead on spaces is being ignored.
For example:
b4A#Ac33*8Pd -- should match
b4A#Ac3 3*8Pd -- should not match
rubular link
Edited to provide further clarification:
Basically, I am trying to avoid having to spell out all the characters in the POSIX [:punct:] class ie !"#$%&'()*+,./:;<=>?#\^_\{|}~-` .. that is why I had a need to use \W .. But I would also want to exclude spaces
I can use a second pair of eyes, and more experienced suggestions here ..
Edited again, to correct mix-ups in counts specified in sub-patterns, as pointed out in the accepted answer below.
Instead of using dot ., use non spaces \S:
^(?=(.*\d){3,})(?=(.*[A-Z]){2})(?=.*\W{1,})(?=(.*[a-z]){1,})\S{12,14}$
// here ___^^
And is this a typo match at least 4 digits - (?=(.*\d){3,}),
it should be:
match at least 3 digits - (?=(.*\d){3,})
or
match at least 4 digits - (?=(.*\d){4,})
Same for other counts.

Understanding negative look aheads in regular expressions

I want to match urls that do NOT contain the string 'localhost' using Ruby regex
Based on answers and comments here, I put together two solutions, both of which seem to work:
Solution A:
(?!.*localhost)^.*$
Example: http://rubular.com/r/tQtbWacl3g
Solution B:
^((?!localhost).)*$
Example: http://rubular.com/r/2KKnQZUMwf
The problem is that I don't understand what they're doing. For example, according to the docs, ^ can be used in various ways:
[^abc] Any single character except: a, b, or c
^ Start of line
But I don't get how it's being applied here.
Can someone breakdown these expressions for me, and how they differ from one another?
In both of your cases, ^ is just the start of the line (since it's not used inside a character class). Since both ^ and the lookahead are zero-width assertions, we can switch them around in the first case - I think that makes it a bit easier to explain:
^(?!.*localhost).*$
The ^ anchors the expression to the beginning of the string. The lookahead then starts from that position and tries to find localhost anywhere the string (the "anywhere" is taken care of by the .* in front of localhost). If that localhost can be found, the subexpression of the lookahead matches and therefore the negative lookahead causes the pattern to fail. Since the lookahead is bound to start at the beginning of the string by the adjacent ^ this means, the pattern overall cannot match. If, however the .*localhost does not match (and hence localhost does not occur in the string), the lookahead succeeds, and the .*$ simply takes care of matching the rest of the string.
Now the other one
^((?!localhost).)*$
This time the lookahead only checks at the current position (there is no .* inside it). But the lookahead is repeated for every single character. This way it does check every single position again. Here is roughly what happens: the ^ makes sure that we're starting at the beginning of the string again. The lookahead checks whether the word localhost is found at that position. If not, all is well, and . consumes one character. The * then repeats both of those steps. We are now one character further in the string, and the lookahead checks whether the second character starts the word localhost - again, if not, all is well, and . consumes another character. This is done for every single character in the string, until we reach the end.
In this particular case both methods are equivalent, and you could select one based on performance (if it matters) or readability (if not; probably the first one). However, in other cases the second variant is preferable, because it allows you to do this repetition for a fixed part of the string, whereas the first variant will always check the entire string.
You can get them easily explained online. The first:
NODE EXPLANATION
--------------------------------------------------------------------------------
(?! look ahead to see if there is not:
--------------------------------------------------------------------------------
.* any character except \n (0 or more times
(matching the most amount possible))
--------------------------------------------------------------------------------
localhost 'localhost'
--------------------------------------------------------------------------------
) end of look-ahead
--------------------------------------------------------------------------------
^ the beginning of the string
--------------------------------------------------------------------------------
.* any character except \n (0 or more times
(matching the most amount possible))
--------------------------------------------------------------------------------
$ before an optional \n, and the end of the
string
--------------------------------------------------------------------------------
' '
And the second:
NODE EXPLANATION
--------------------------------------------------------------------------------
^ the beginning of the string
--------------------------------------------------------------------------------
( group and capture to \1 (0 or more times
(matching the most amount possible)):
--------------------------------------------------------------------------------
(?! look ahead to see if there is not:
--------------------------------------------------------------------------------
localhost 'localhost'
--------------------------------------------------------------------------------
) end of look-ahead
--------------------------------------------------------------------------------
. any character except \n
--------------------------------------------------------------------------------
)* end of \1 (NOTE: because you are using a
quantifier on this capture, only the LAST
repetition of the captured pattern will be
stored in \1)
--------------------------------------------------------------------------------
$ before an optional \n, and the end of the
string
--------------------------------------------------------------------------------
As an aside comment, these two solutions are slow. A better way is to use:
^(?:[^l]+|l(?!ocalhost))+
In other words: all characters that are not a l or a l not followed by ocalhost
This will give you a better result since you don't have to check each positions. (For an url like http://localhost:1234/toto this kind of pattern will fail in ~15 steps vs ~50 steps for the two other patterns)
You can improve this pattern using atomic groups and possessive quantifiers to forbid backtracks:
^(?>[^l]++|l(?!ocalhost))++
Note that in your particular case you can speed up your pattern considering that you only want to check the host part of the url. Example:
^http:\/\/(?>[^l\s\/]++|l(?!ocalhost))++(?>\/\S*+|$)
according to the docs, ^ can be used in various ways:
[^abc] Any single character except: a, b, or c
^ Start of line
But I don't get how it's being applied here.
In the regex
(?!.*localhost)^.*$
The ^ is not inside any brackets, so the second one applies. Here is a trivial example:
/^x/
That regex says to match the start of the line, followed by the letter x. So it will match lines like this:
xcellent
x-ray
However, the regex will not match the lines:
axb
excellent
...because the x does not appear directly after the start of the line. You may wonder why 'axb' doesn't match. After all 'a' is the start of the line, and it is followed by an 'x'. However, 'start of the line' is just to the left of the first character, like this:
|
V
axb
^ is called a zero-width match because it matches the slim sliver just to the left of the 'a', e.g. between the starting quote mark and the 'a' in "axb". There's not really any space there, so ^ matches something that is 0 width.
Here is another example:
/x^/
That says to match the character x followed by the start of the line. Well, no line can have an x first and then the start of the line second, so that won't ever match anything.
Now your regex:
(?!.*localhost)^.*$
Like the 'start of line' ^, a lookahead is zero-width. What that means is that the lookahead scans the string looking for the match, but when it finds the match, it comes back to the beginning of the string, and then looks for the rest of the regex:
^.*$
One word of advice, when a regex requires lookarounds(lookaheads or lookbehinds), 99% of the time there are easier ways to do what you want. For instance, you could write:
url = "....."
if url.index('http') == 0
#then the line starts with 'http'
else
#the line doesn't start with http
end
That's much easier to read, and it doesn't require trying to decipher a complex regex.

How to find whole complete number with ruby regex

I'm looking to find the first whole occurance of a number within a string. I'm not looking for the first digit, rather the whole first number. So, for example, the first number in: w134fklj342 is 134, while the first number in 1235alkj9342klja9034 is 1235.
I have attempted to use \d but I'm unsure how to expand that to include multiple digits (without specifying how long the number is).
I think, you're looking for this regex
\d+
"Plus" means "one or more". This regex will match all numbers within a string, so pick first one.
strings = ['w134fklj342', '1235alkj9342klja9034']
strings.each do |s|
puts s[/\d+/]
end
# >> 134
# >> 1235
Demo: http://rubular.com/r/YE8kPE2SyW
The easiest way to understand regexes is to think of eachbit is one character; e.g: \d or [1234567890] or [0-9] will match one digit.
To expand this one character you have 2 basic options: * and +
* will match the character 0 or more times
+ will match it one or more times
Like Sergio said you should use \d+ to match many digits.
Excellent tutorial for regexes in general: http://www.regular-expressions.info/tutorial.html

How can I write a regex in Ruby that will determine if a string meets this criteria?

How can I write a regex in Ruby 1.9.2 that will determine if a string meets this criteria:
Can only include letters, numbers and the - character
Cannot be an empty string, i.e. cannot have a length of 0
Must contain at least one letter
/\A[a-z0-9-]*[a-z][a-z0-9-]*\z/i
It goes like
beginning of string
some (or zero) letters, digits and/or dashes
a letter
some (or zero) letters, digits and/or dashes
end of string
I suppose these two will help you: /\A[a-z0-9\-]{1,}\z/i and /[a-z]{1,}/i. The first one checks on first two rules and the second one checks for the last condition.
No regex:
str.count("a-zA-Z") > 0 && str.count("^a-zA-Z0-9-") == 0
You can take a look at this tutorial for how to use regular expressions in ruby. With regards to what you need, you can use the following:
^[A-Za-z0-9\-]+$
The ^ will instruct the regex engine to start matching from the very beginning of the string.
The [..] will instruct the regex engine to match any one of the characters they contain.
A-Z mean any upper case letter, a-z means any lower case letter and 0-9 means any number.
The \- will instruct the regex engine to match the -. The \ is used infront of it because the - in regex is a special symbol, so it needs to be escaped
The $ will instruct the regex engine to stop matching at the end of the line.
The + instructs the regex engine to match what is contained between the square brackets one or more time.
You can also use the \i flag to make your search case insensitive, so the regex might become something like this:
^[a-z0-9\-]+/i$

Resources