How exactly does this work string.split(/\?|\.|!/).size? - ruby

I know, or at least I think I know, what this does (string.split(/\?|\.|!/).size); splits the string at every ending punctuation into an array and then gets the size of the array.
The part I am confused with is (/\?|\.|!/).
Thank you for your explanation.

Regular expressions are surrounded by slashes / /
The backslash before the question mark and dot means use those characters literally (don't interpret them as special instructions)
The vertical pipes are "or"
So you have / then question mark \? then "or" | then period \. then "or" | then exclamation point ! then / to end the expression.
/\?|\.|!/

It's a Regular Expression. That particular one matches any '?', '.' or '!' in the target string.
You can learn more about them here: http://regexr.com/

A regular expression splitting on the char "a" would look like this: /a/. A regular expression splitting on "a" or "b" is like this: /a|b/. So splitting on "?", "!" and "." would look like /?|!|./ - but it does not. Unfortunately, "?", and "." have special meaning in regexps which we do not want in this case, so they must be escaped, using "\".
A way to avoid this is to use Regexp.union("?","!",".") which results in /\?|!|\./

(/\?|\.|!/)
Working outside in:
The parentheses () captures everything enclosed.
The // tell Ruby you're using a Regular Expression.
\? Matches any ?
\. Matches any .
! Matches any !
The preceding \ tells Ruby we want to find these specific characters in the string, rather than using them as special characters.
Special characters (that need to be escaped to be matched) are:
. | ( ) [ ] { } + \ ^ $ * ?.
There is a nice guide to Ruby RegEx at:
http://rubular.com/ & http://www.tutorialspoint.com/ruby/ruby_regular_expressions.htm

For SO answers that involve regular expressions, I often use the "extended" mode, which makes them self-documenting. This one would be:
r = /
\? # match a question mark
| # or
\. # match a period
| # or
! # match an explamation mark
/x # extended mode
str = "Out, damn'd spot! out, I say!—One; two: why, then 'tis time to " +
"do't.—Hell is murky.—Fie, my lord, fie, a soldier, and afeard?"
str.split(r)
#=> ["Out, damn'd spot",
# " out, I say",
# "—One; two: why, then 'tis time to do't",
# "—Hell is murky",
# "—Fie, my lord, fie, a soldier, and afeard"]
str.split(r).size #=> 5
#steenslag mentioned Regexp::union. You could also use Regexp::new to write (with single quotes):
r = Regexp.new('\?|\.|!')
#=> /\?|\.|!/
but it really doesn't buy you anything here. You might find it useful in other situations, however.

Related

Why do I get the Regexp warning "warning: nested repeat operator ? and * was replaced with '*'"

I have a regular expression for parsing Norwegian street addresses:
STREET_ADDRESS_PATTERN = <<-REGEX
^
(?<street_name>[\w\D\. ]+)\s+
(?<house_number>\d+)
(?<entrance>[A-Z])?\s*,\s*
(
(?<postal_code>\d{4})\s+
(?<city>[\w\D ]+)
)?
$
REGEX
It worked earlier, and I can't remember if I changed something, and in which case what I changed. In any case, now I'm getting this warning:
warning: nested repeat operator ? and * was replaced with '*'
And the match is returning nil. Can anybody see why I'm getting this warning?
Note: I'm currently using this (fake) address to test the expression: "Storgata 38H, 0273 Oslo".
Let's take a look at something you're doing to the poor regular expression engine:
(?<street_name>[\w\D\. ]+)\s+
The problem is inside the character class: [\w\D\. ]+. The following definitions are from Ruby's Regexp class documentation:
/\w/ - A word character ([a-zA-Z0-9_])
/\D/ - A non-digit character ([^0-9])
You're telling the engine to select:
abcdefghijklmnopqrstuvwxyz
ABCDEFGHIJKLMNOPQRSTUVWXYZ
0123456789
_
every character that is NOT 0123456789
. and spaces
In other words, every possible character. You'd do just as well to use:
(?<street_name>.+)
because that's going to be pretty greedy. This Rubular example shows your pattern is allowing the engine to capture everything thrown at it, including almost the entire string Storgata 38H, 0273 Oslo: http://rubular.com/r/nMfcB0cUdu
Also, \. inside [] is the same as [.] because the special use of period as a wildcard is escaped automatically inside the brackets. You don't need to escape it again to try to make it literal because it already is literal.
I'd strongly recommend using Rubular to take a look at each section of your regex, and try matching against several other possible addresses strings, and see if Rubular says the patterns are going to match what you expect. Once you've done that, try putting together the complete pattern. As is, I think your subsections are interacting and masking some problems that will come back to bite you later.
My hope was that [\w\D] would select all word characters except numbers... Any way to do that?
Ah. Let's dive into the documentation again:
POSIX bracket expressions are also similar to character classes. They provide a portable alternative to the above, with the added benefit that they encompass non-ASCII characters. For instance, /\d/ matches only the ASCII decimal digits (0-9); whereas /[[:digit:]]/ matches any character in the Unicode Nd category.
/[[:alnum:]]/ - Alphabetic and numeric character
/[[:alpha:]]/ - Alphabetic character
/[[:blank:]]/ - Space or tab
/[[:cntrl:]]/ - Control character
/[[:digit:]]/ - Digit
/[[:graph:]]/ - Non-blank character (excludes spaces, control characters, and similar)
/[[:lower:]]/ - Lowercase alphabetical character
/[[:print:]]/ - Like [:graph:], but includes the space character
/[[:punct:]]/ - Punctuation character
/[[:space:]]/ - Whitespace character ([:blank:], newline, carriage return, etc.)
/[[:upper:]]/ - Uppercase alphabetical
/[[:xdigit:]]/ - Digit allowed in a hexadecimal number (i.e., 0-9a-fA-F)
You want to use the /[[:alpha:]]/ pattern. As displayed it would capture only one character, but it'd be within any of the POSIX set of "letter" characters, which is the range you want:
[4] (pry) main: 0> 'æ, ø and å'.scan(/[[:alpha:]]/)
[
[0] "æ",
[1] "ø",
[2] "a",
[3] "n",
[4] "d",
[5] "å"
]
Here's a wee tweak:
[5] (pry) main: 0> 'æ, ø and å'.scan(/[[:alpha:]]+/)
[
[0] "æ",
[1] "ø",
[2] "and",
[3] "å"
]
Oh, now I see what I did. I replaced the ' delimiters of the string with <<-REGEX which means that all backslashes in the expression must now be escaped. Changing back to single ticks fixed the issue. After sepp2k's recommendation I further edited the Regex string into a literal:
STREET_ADDRESS_PATTERN = /
^
(?<street_name>[\w\D\. ]+)\s+
(?<house_number>\d+)
(?<entrance>[A-Z])?\s*,\s*
(
(?<postal_code>\d{4})\s+
(?<city>[\w\D ]+)
)?
$
/xi

Ruby Regex Match Between "foo" and "bar"

I have unfortunately wandered into a situation where I need regex using Ruby. Basically I want to match this string after the underscore and before the first parentheses. So the end result would be 'table salt'.
_____ table salt (1) [F]
As usual I tried to fight this battle on my own and with rubular.com. I got the first part
^_____ (Match the beginning of the string with underscores ).
Then I got bolder,
^_____(.*?) ( Do the first part of the match, then give me any amount of words and letters after it )
Regex had had enough and put an end to that nonsense and crapped out. So I was wondering if anyone on stackoverflow knew or would have any hints on how to say my goal to the Ruby Regex parser.
EDIT: Thanks everyone, this is the pattern I ended up using after creating it with rubular.
ingredientNameRegex = /^_+([^(]*)/;
Everything got better once I took a deep breath, and thought about what I was trying to say.
str = "_____ table salt (1) [F]"
p str[ /_{3}\s(.+?)\s+\(/, 1 ]
#=> "table salt"
That says:
Find at least three underscores
and a whitespace character (\s)
and then one or more (+) of any character (.), but as little as possible (?), up until you find
one or more whitespace characters,
and then a literal (
The parens in the middle save that bit, and the 1 pulls it out.
Try this: ^[_]+([^(]*)\(
It will match lines starting with one or more underscores followed by anything not equal to an opening bracket: http://rubular.com/r/vthpGpVr4y
Here's working regex:
str = "_____ table salt (1) [F]"
match = str.match(/_([^_]+?)\(/)
p match[1].strip # => "table salt"
You could use
^_____\s*([^(]+?)\s*\(
^_____ match the underscore from the beginning of string
\s* matches any whitespace character
( grouping start
[^(]+ matches all non ( character at least once
? matches the shortest possible string (non greedy)
) grouping end
\s* matches any whitespace character
\( find the (
"_____ table salt (1) [F]".gsub(/[_]\s(.+)\s\(/, ' >>>\1<<< ')
# => "____ >>>table salt<<< 1) [F]"
It seems to me the simplest regex to do what you want is:
/^_____ ([\w\s]+) /
That says:
leading underscores, space, then capture any combination of word chars or spaces, then another space.

Ruby regular expression

Apparently I still don't understand exactly how it works ...
Here is my problem: I'm trying to match numbers in strings such as:
910 -6.258000 6.290
That string should gives me an array like this:
[910, -6.2580000, 6.290]
while the string
blabla9999 some more text 1.1
should not be matched.
The regex I'm trying to use is
/([-]?\d+[.]?\d+)/
but it doesn't do exactly that. Could someone help me ?
It would be great if the answer could clarify the use of the parenthesis in the matching.
Here's a pattern that works:
/^[^\d]+?\d+[^\d]+?\d+[\.]?\d+$/
Note that [^\d]+ means at least one non digit character.
On second thought, here's a more generic solution that doesn't need to deal with regular expressions:
str.gsub(/[^\d.-]+/, " ").split.collect{|d| d.to_f}
Example:
str = "blabla9999 some more text -1.1"
Parsed:
[9999.0, -1.1]
The parenthesis have different meanings.
[] defines a character class, that means one character is matched that is part of this class
() is defining a capturing group, the string that is matched by this part in brackets is put into a variable.
You did not define any anchors so your pattern will match your second string
blabla9999 some more text 1.1
^^^^ here ^^^ and here
Maybe this is more what you wanted
^(\s*-?\d+(?:\.\d+)?\s*)+$
See it here on Regexr
^ anchors the pattern to the start of the string and $ to the end.
it allows Whitespace \s before and after the number and an optional fraction part (?:\.\d+)? This kind of pattern will be matched at least once.
maybe /(-?\d+(.\d+)?)+/
irb(main):010:0> "910 -6.258000 6.290".scan(/(\-?\d+(\.\d+)?)+/).map{|x| x[0]}
=> ["910", "-6.258000", "6.290"]
str = " 910 -6.258000 6.290"
str.scan(/-?\d+\.?\d+/).map(&:to_f)
# => [910.0, -6.258, 6.29]
If you don't want integers to be converted to floats, try this:
str = " 910 -6.258000 6.290"
str.scan(/-?\d+\.?\d+/).map do |ns|
ns[/\./] ? ns.to_f : ns.to_i
end
# => [910, -6.258, 6.29]

How to remove the first 4 characters from a string if it matches a pattern in Ruby

I have the following string:
"h3. My Title Goes Here"
I basically want to remove the first four characters from the string so that I just get back:
"My Title Goes Here".
The thing is I am iterating over an array of strings and not all have the h3. part in front so I can't just ditch the first four characters blindly.
I checked the docs and the closest thing I could find was chomp, but that only works for the end of a string.
Right now I am doing this:
"h3. My Title Goes Here".reverse.chomp(" .3h").reverse
This gives me my desired output, but there has to be a better way. I don't want to reverse a string twice for no reason. Is there another method that will work?
To alter the original string, use sub!, e.g.:
my_strings = [ "h3. My Title Goes Here", "No h3. at the start of this line" ]
my_strings.each { |s| s.sub!(/^h3\. /, '') }
To not alter the original and only return the result, remove the exclamation point, i.e. use sub. In the general case you may have regular expressions that you can and want to match more than one instance of, in that case use gsub! and gsub—without the g only the first match is replaced (as you want here, and in any case the ^ can only match once to the start of the string).
You can use sub with a regular expression:
s = 'h3. foo'
s.sub!(/^h[0-9]+\. /, '')
puts s
Output:
foo
The regular expression should be understood as follows:
^ Match from the start of the string.
h A literal "h".
[0-9] A digit from 0-9.
+ One or more of the previous (i.e. one or more digits)
\. A literal period.
A space (yes, spaces are significant by default in regular expressions!)
You can modify the regular expression to suit your needs. See a regular expression tutorial or syntax guide, for example here.
A standard approach would be to use regular expressions:
"h3. My Title Goes Here".gsub /^h3\. /, '' #=> "My Title Goes Here"
gsub means globally substitute and it replaces a pattern by a string, in this case an empty string.
The regular expression is enclosed in / and constitutes of:
^ means beginning of the string
h3 is matched literally, so it means h3
\. - a dot normally means any character so we escape it with a backslash
is matched literally

Ruby gsub / regex modifiers?

Where can I find the documentation on the modifiers for gsub? \a \b \c \1 \2 \3 %a %b %c $1 $2 %3 etc.?
Specifically, I'm looking at this code... something.gsub(/%u/, unit) what's the %u?
First off, %u is nothing special in ruby regex:
mixonic#pandora ~ $ irb
irb(main):001:0> '%u'.gsub(/%u/,'heyhey')
=> "heyhey"
The definitive documentation for Ruby 1.8 regex is in the Ruby Doc Bundle:
http://ruby-doc.org/docs/ruby-doc-bundle/Manual/man-1.4/syntax.html#regexp
Strings delimited by slashes are
regular expressions. The characters
right after latter slash denotes the
option to the regular expression.
Option i means that regular expression
is case insensitive. Option i means
that regular expression does
expression substitution only once at
the first time it evaluated. Option x
means extended regular expression,
which means whitespaces and commens
are allowd in the expression. Option p
denotes POSIX mode, in which newlines
are treated as normal character
(matches with dots).
The %r/STRING/ is the another form of
the regular expression.
^
beginning of a line or string
$
end of a line or string
.
any character except newline
\w
word character[0-9A-Za-z_]
\W
non-word character
\s
whitespace character[ \t\n\r\f]
\S
non-whitespace character
\d
digit, same as[0-9]
\D
non-digit
\A
beginning of a string
\Z
end of a string, or before newline at the end
\z
end of a string
\b
word boundary(outside[]only)
\B
non-word boundary
\b
backspace(0x08)(inside[]only)
[ ]
any single character of set
*
0 or more previous regular expression
*?
0 or more previous regular expression(non greedy)
+
1 or more previous regular expression
+?
1 or more previous regular expression(non greedy)
{m,n}
at least m but most n previous regular expression
{m,n}?
at least m but most n previous regular expression(non greedy)
?
0 or 1 previous regular expression
|
alternation
( )
grouping regular expressions
(?# )
comment
(?: )
grouping without backreferences
(?= )
zero-width positive look-ahead assertion
(?! )
zero-width negative look-ahead assertion
(?ix-ix)
turns on (or off) `i' and `x' options within regular expression.
These modifiers are localized inside
an enclosing group (if any).
(?ix-ix: )
turns on (or off) i' andx' options within this non-capturing
group.
Backslash notation and expression
substitution available in regular
expressions.
Good luck!
Zenspider's Quickref contains a section explaining which escape sequences can be used in regexen and one listing the pseudo variables that get set by a regexp match. In the second argument to gsub you simply write the name of the variable with a backslash instead of a $ and it will be replaced with the value of that variable after applying the regexp. If you use a double quoted string, you need to use two backslashes.
When using the block-form of gsub you can simply use the variables directly. If you return a string containing e.g. \1 from the block, that will not be replaced with $1. That only happens when using the two-argument form.
If you use block in sub/gsub you can access to the groups like that :
>> rx = /(ab(cd)ef)/
>> s = "-abcdef-abcdef"
>> s.gsub(rx) { $2 }
=> "cdgh-cdghi"
For Ruby 1.9's Oniguruma there is a good documentation of the regular expression here.
gsub is also a string substitution function within the LUA language.
Within the LUA regex language %u represents the Upper Case character class. i.e. It will match all upper case letters. Similarly %l will match lower case.
LUA Regex Class Patterns

Resources