Difference between \A \z and ^ $ in Ruby regular expressions - ruby

In the documentation I read:
Use \A and \z to match the start and end of the string, ^ and $ match the start/end of a line.
I am going to apply a regular expression to check username (or e-mail is the same) submitted by user. Which expression should I use with validates_format_of in model? I can't understand the difference: I've always used ^ and $ ...

If you're depending on the regular expression for validation, you always want to use \A and \z. ^ and $ will only match up until a newline character, which means they could use an email like me#example.com\n<script>dangerous_stuff();</script> and still have it validate, since the regex only sees everything before the \n.
My recommendation would just be completely stripping new lines from a username or email beforehand, since there's pretty much no legitimate reason for one. Then you can safely use EITHER \A \z or ^ $.

According to Pickaxe:
^
Matches the beginning of a line.
$
Matches the end of a line.
\A
Matches the beginning of the string.
\z
Matches the end of the string.
\Z
Matches the end of the string unless the string ends with a "\n", in which case it matches just before the "\n".
So, use \A and lowercase \z. If you use \Z someone could sneak in a newline character. This is not dangerous I think, but might screw up algorithms that assume that there's no whitespace in the string. Depending on your regex and string-length constraints someone could use an invisible name with just a newline character.
JavaScript's implementation of Regex treats \A as a literal 'A' (ref). So watch yourself out there and test.

Difference By Example
/^foo$/ matches any of the following, /\Afoo\z/ does not:
whatever1
foo
whatever2
foo
whatever2
whatever1
foo
/^foo$/ and /\Afoo\z/ all match the following:
foo

The start and end of a string may not necessarily be the same thing as the start and end of a line. Imagine if you used the following as your test string:
my
name
is
Andrew
Notice that the string has many lines in it - the ^ and $ characters allow you to match the beginning and end of those lines (basically treating the \n character as a delimeter) while \A and \Z allow you to match the beginning and end of the entire string.

Related

Why am I not able to match multiple lines with this regex on rubular?

I'm working with the following regex (taken from the devise.rb file that devise generates):
\A[^#\s]+#[^#\s]+\z
Usually, when I'm learning about a regex I use rubular. For example, if I wanted to learn about the regex /.a./, I would set up my workspace as shown here:
Notice how I'm using multiple examples:
foo
bar
baz
And rubular is giving me feedback that both bar and baz match.
Now I'd like to learn about the regex that devise generates: /\A[^#\s]+#[^#\s]+\z/. So I set up my rubular workspace as shown here here:
There isn't a match. It's because I have two examples:
foo#foo.com
cats#cat.com
But I was expecting them both to match. Why aren't both test strings matching?
This is because the regex /\A[^#\s]+#[^#\s]+\z/ is matching the start of the string with \A and end of the string with \z.
If you remove both \A and \z and instead try to match /[^#\s]+#[^#\s]+/ then it will match both email addresses as shown here:
Also, it's worth mentioning that the start and end of a string is different from the start and end of a line. Each are represented by four different patterns shown below and also on rubular in the Regex quick reference:
^ - Start of line
$ - End of line
\A - Start of string
\z - End of string
There can be multiple lines in a string; however, a single string goes from \A to \z. So to continue with this multiple email example. Replacing the start and end of a string patterns with the start and end of a line patterns to get: /^[^#\s]+#[^#\s]+$/ will also match, shown below and on rubular:

Ruby Regexp character class with new line, why not match?

I want to use this regex to match any block comment (c-style) in a string.
But why the below does not?
rblockcmt = Regexp.new "/\\*[.\s]*?\\*/" # match block comment
p rblockcmt=~"/* 22/Nov - add fee update */"
==> nil
And in addition to what Sir Swoveland posted, a . matches any character except a newline:
The following metacharacters also behave like character classes:
/./ - Any character except a newline.
https://ruby-doc.org/core-2.3.0/Regexp.html
If you need . to match a newline, you can specify the m flag, e.g. /.*?/m
Options
The end delimiter for a regexp can be followed by one or more
single-letter options which control how the pattern can match.
/pat/i - Ignore case
/pat/m - Treat a newline as a character matched by .
...
https://ruby-doc.org/core-2.3.0/Regexp.html
Because having exceptions/quirks like newline not matching a . can be painful, some people specify the m option for every regex they write.
It appears that you intend [.\s]*? to match any character or a whitespace, zero or more times, lazily. Firstly, whitespaces are characters, so you don't need \s. That simplifies your expression to [.]*?. Secondly, if your intent is to match any character there is no need for a character class, just write .. Thirdly, and most importantly, a period within a character class is simply the character ".".
You want .*? (or [^*]*).

Regex to match "AAAA:AAA" pattern

A string must begin with 3 or 4 letters (not numbers), and a ":" symbol should follow these letters, and after the colon there should be three more characters, like AAA. For example, AAAA:AAA or AAA:AAA.
I`m starting to build this, but regex is so much pain for me, can anyone help me with this?
Here is what I have now:
^[a-zA-Z]{3,4}(:)$
Your regex is almost there: you need to add [a-zA-Z]{3}.
I prefer the [[:alpha:]] POSIX class in Ruby to match letters though.
/[[:alpha:]]/ - Alphabetic character
POSIX bracket expressions are also similar to character classes. They provide a portable alternative to the above, with the added benefit that they encompass non-ASCII characters.
So, here is a possible regex:
\A[[:alpha:]]{3,4}:[[:alpha:]]{3}\z
See demo
The regex matches:
\A - start of string (in RoR, you have to use \A instead of ^, or you will get errors)
[[:alpha:]]{3,4} - 3 or 4 letters
: - literal :
[[:alpha:]]{3} - 3 letters
\z - end of string (in RoR, you have to use \z instead of $, or you will get errors)
To allow just AAA or AAAA, you need to introduce an optional (? quantifier) non-capturing group ((?:...) construction):
\A[[:alpha:]]{3,4}(?::[[:alpha:]]{3})?\z
^^^ ^^
See another demo
Try using this (quotes if regex in your dialect must be passed as a string)
"^[a-zA-Z]{3,4}:[a-zA-Z]{3}$"

How to allow string with letters, numbers, period, hyphen, and underscore?

I am trying to make a regular expression, that allow to create string with the small and big letters + numbers - a-zA-z0-9 and also with the chars: .-_
How do I make such a regex?
The following regex should be what you are looking for (explanation below):
\A[-\w.]*\z
The following character class should match only the characters that you want to allow:
[-a-zA-z0-9_.]
You could shorten this to the following since \w is equivalent to [a-zA-z0-9_]:
[-\w.]
Note that to include a literal - in your character class, it needs to be first character because otherwise it will be interpreted as a range (for example [a-d] is equivalent to [abcd]). The other option is to escape it with a backslash.
Normally . means any character except newlines, and you would need to escape it to match a literal period, but this isn't necessary inside of character classes.
The \A and \z are anchors to the beginning and end of the string, otherwise you would match strings that contain any of the allowed characters, instead of strings that contain only the allowed characters.
The * means zero or more characters, if you want it to require one or more characters change the * to a +.
/\A[\w\-\.]+\z/
\w means alphanumeric (case-insensitive) and "_"
\- means dash
\. means period
\A means beginning (even "stronger" than ^)
\z means end (even "stronger" than $)
for example:
>> 'a-zA-z0-9._' =~ /\A[\w\-\.]+\z/
=> 0 # this means a match
UPDATED thanks phrogz for improvement

Ruby RegEx problem text.gsub[^\W-], '') fails

I'm trying to learn RegEx in Ruby, based on what I'm reading in "The Rails Way". But, even this simple example has me stumped. I can't tell if it is a typo or not:
text.gsub(/\s/, "-").gsub([^\W-], '').downcase
It seems to me that this would replace all spaces with -, then anywhere a string starts with a non letter or number followed by a dash, replace that with ''. But, using irb, it fails first on ^:
syntax error, unexpected '^', expecting ']'
If I take out the ^, it fails again on the W.
>> text = "I love spaces"
=> "I love spaces"
>> text.gsub(/\s/, "-").gsub(/[^\W-]/, '').downcase
=> "--"
Missing //
Although this makes a little more sense :-)
>> text.gsub(/\s/, "-").gsub(/([^\W-])/, '\1').downcase
=> "i-love-spaces"
And this is probably what is meant
>> text.gsub(/\s/, "-").gsub(/[^\w-]/, '').downcase
=> "i-love-spaces"
\W means "not a word"
\w means "a word"
The // generate a regexp object
/[^\W-]/.class
=> Regexp
Step 1: Add this to your bookmarks. Whenever I need to look up regexes, it's my first stop
Step 2: Let's walk through your code
text.gsub(/\s/, "-")
You're calling the gsub function, and giving it 2 parameters.
The first parameter is /\s/, which is ruby for "create a new regexp containing \s (the // are like special "" for regexes).
The second parameter is the string "-".
This will therefore replace all whitespace characters with hyphens. So far, so good.
.gsub([^\W-], '').downcase
Next you call gsub again, passing it 2 parameters.
The first parameter is [^\W-]. Because we didn't quote it in forward-slashes, ruby will literally try run that code. [] creates an array, then it tries to put ^\W- into the array, which is not valid code, so it breaks.
Changing it to /[^\W-]/ gives us a valid regex.
Looking at the regex, the [] says 'match any character in this group. The group contains \W (which means non-word character) and -, so the regex should match any non-word character, or any hyphen.
As the second thing you pass to gsub is an empty string, it should end up replacing all the non-word characters and hyphens with empty string (thereby stripping them out )
.downcase
Which just converts the string to lower case.
Hope this helps :-)
You forgot the slashes. It should be /[^\W-]/
Well, .gsub(/[^\W-]/,'') says replace anything that's a not word nor a - for nothing.
You probably want
>> text.gsub(/\s/, "-").gsub(/[^\w-]/, '').downcase
=> "i-love-spaces"
Lower case \w (\W is just the opposite)
The slashes are to say that the thing between them is a regular expression, much like quotes say the thing between them is a string.

Resources