How do I match valid words with a ruby regular expression - ruby

Using a ruby regular expression, how do I match all words in a coma separated list, but only match if the entire word contains valid word characters (i.e.: letter number or underscore). For instance, given the string:
"see, jane, run, r#un, j#ne, r!n"
I would like to match the words
'see', 'jane' and 'run',
but not the words
'r#un', 'j#ne' or 'r1n'.
I do not want to match the coma ... just the words themselves.
I have started the regex here: http://rubular.com/regexes/12126

s="see, jane, run, r#un, j#ne, r!n, fast"
s.scan(/(?:\A|,\s*)(\w+)(?=,|\Z)/).flatten
# => ["see", "jane", "run", "fast"]

another way
result = s.split(/[\s,]/).select{|_w| _w =~ /^\w+$/}

Related

How to extract substring between two characters/substrings

I have a string:
string1 = "my name is fname.lname and i live in xyz. my lname is not common"
I want to extract a substring from string1 that is anything between the first empty space " " and ".lname". In the case above, the answer should be "fname.lname"`.
string1[/(?<= ).*?(?=\.lname\b)/]
#=> "name is fname"
(?<= ) is a positive lookbehind that requires the first character matched be immediately preceded by a space, but that space is not part of the match.
(?=\.lname\b) is a positive lookahead that requires the last character matched is immediately followed by the string ".lname"1
, which is itself followed by a word break (\b), but that string is not part of the match. That ensures, for example, that "\.lnamespace" is not matched. If that should be matched, remove \b.
.*? matches zero more characters (.*), non-greedily (?). (Matches are by default greedy.) The non-greedy qualifier has the following effect:
"my name is fname.lname and fname.lname"[/(?<= ).*(?=\.lname\b)/]
#=> "name is fname.lname and fname"
"my name is fname.lname and fname.lname"[/(?<= ).*?(?=\.lname\b)/]
#=> "name is fname"
In other words, the non-greedy (greedy) match matches the first (last) occurrence of ".lname" in the string.
This could alternatively be written with a capture group and no lookarounds:
string1[/ (.*?)\.lname\b/, 1]
#=> "name is fname"
This regular expression reads, "mactch a space followed by zero or more characters, saved in capture group 1, followed by the string ".name" followed by a word break. This uses the form of String#[] that has two arguments, a reference to a capture group.
Yet another way follows.
string1[(string1 =~ / /)+1..(string1 =~ /\.lname\b/)-1]
#=> "name is fname"
1 The period in ".lname" must be escaped because an unescaped period in a regular expression (except in a character class) matches any character.

regular expression in Ruby with parentheses and match

In Ruby,
x = "this is a test".match(/(\w+) (\w+)/)
puts x[0], x[1], x[2]
why is the output
this is
this
is
Nothing special is going on here. You have the pattern
(\w+) (\w+)
namely two words separated by a space. That would be "this is" in your example (since we start looking for matches from the beginning of the string). The full match goes into the zeroth element of the return value, in your case x[0].
Now parentheses capture matches. The first left parenthesis starts at the first word, namely "this" so that value goes into x[1]. The second left parenthesis starts a group that matches the word "is", which will be captured into x[2].
Again, nothing special. This is how regular expression matching and grouping work in many, many languages.

Splitting the content of brackets without separating the brackets ruby

I am currently working on a ruby program to calculate terms. It works perfectly fine except for one thing: brackets. I need to filter the content or at least, to put the content into an array, but I have tried for an hour to come up with a solution. Here is my code:
splitted = term.split(/\(+|\)+/)
I need an array instead of the brackets, for example:
"1-(2+3)" #=>["1", "-", ["2", "+", "3"]]
I already tried this:
/(\((?<=.*)\))/
but it returned:
Invalid pattern in look-behind.
Can someone help me with this?
UPDATE
I forgot to mention, that my program will split the term, I only need the content of the brackets to be an array.
If you need to keep track of the hierarchy of parentheses with arrays, you won't manage it just with regular expressions. You'll need to parse the string word by word, and keep a stack of expressions.
Pseudocode:
Expressions = new stack
Add new array on stack
while word in string:
if word is "(": Add new array on stack
Else if word is ")": Remove the last array from the stack and add it to the (next) last array of the stack
Else: Add the word to the last array of the stack
When exiting the loop, there should be only one array in the stack (if not, you have inconsistent opening/closing parentheses).
Note: If your ultimate goal is to evaluate the expression, you could save time and parse the string in Postfix aka Reverse-Polish Notation.
Also consider using off-the-shelf libraries.
A solution depends on the pattern you expect between the parentheses, which you have not specified. (For example, for "(st12uv)" you might want ["st", "12", "uv"], ["st12", "uv"], ["st1", "2uv"] and so on). If, as in your example, it is a natural number followed by a +, followed by another natural number, you could do this:
str = "1-( 2+ 3)"
r = /
\(\s* # match a left parenthesis followed by >= 0 whitespace chars
(\d+) # match one or more digits in a capture group
\s* # match >= 0 whitespace chars
(\+) # match a plus sign in a capture group
\s* # match >= 0 whitespace chars
(\d+) # match one or more digits in a capture group
\s* # match >= 0 whitespace chars
\) # match a right parenthesis
/x
str.scan(r0).first
=> ["2", "+", "3"]
Suppose instead + could be +, -, * or /. Then you could change:
(\+)
to:
([-+*\/])
Note that, in a character class, + needn't be escaped and - needn't be escaped if it is the first or last character of the class (as in those cases it would not signify a range).
Incidentally, you received the error message, "Invalid pattern in look-behind" because Ruby's lookarounds cannot contain variable-length matches (i.e., .*). With positive lookbehinds you can get around that by using \K instead. For example,
r = /
\d+ # match one or more digits
\K # forget everything previously matched
[a-z]+ # match one or more lowercase letters
/x
"123abc"[r] #=> "abc"

Regex matching chars around text

I have a string with chars inside and I would like to match only the chars around a string.
"This is a [1]test[/1] string. And [2]test[/2]"
Rubular http://rubular.com/r/f2Xwe3zPzo
Currently, the code in the link matches the text inside the special chars, how can I change it?
Update
To clarify my question. It should only match if the opening and closing has the same number.
"[2]first[/2] [1]second[/2]"
In the code above, only first should match and not second. The text inside the special chars (first), should be ignored.
Try this:
(\[[0-9]\]).+?(\[\/[0-9]\])
Permalink to the example on Rubular.
Update
Since you want to remove the 'special' characters, try this instead:
foo = "This is a [1]test[/1] string. And [2]test[/2]"
foo.gsub /\[\/?\d\]/, ""
# => "This is a test string. And test"
Update, Part II
You only want to remove the 'special' characters when the surrounding tags match, so what about this:
foo = "This is a [1]test[/1] string. And [2]test[/2], but not [3]test[/2]"
foo.gsub /(?:\[(?<number>\d)\])(?<content>.+?)(?:\[\/\k<number>\])/, '\k<content>'
# => "This is a test string. And test, but not [3]test[/2]"
\[([0-9])\].+?\[\/\1\]
([0-9]) is a capture since it is surrounded with parentheses. The \1 tells it to use the result of that capture. If you had more than one capture, you could reference them as well, \2, \3, etc.
Rubular
You can also use a named capture, rather than \1 to make it a little less cryptic. As in: \[(?<number>[0-9])\].+?\[\/\k<number>\]
Here's a way to do it that uses the form of String#gsub that takes a block. The idea is to pull strings such as "[1]test[/1]" into the block, and there remove the unwanted bits.
str = "This is a [1]test[/1] string. And [2]test[/2], plus [3]test[/99]"
r = /
\[ # match a left bracket
(\d+) # capture one or more digits in capture group 1
\] # match a right bracket
.+? # match one or more characters lazily
\[\/ # match a left bracket and forward slash
\1 # match the contents of capture group 1
\] # match a right bracket
/x
str.gsub(r) { |s| s[/(?<=\]).*?(?=\[)/] }
#=> "This is a test string. And test, plus [3]test[/99]"
Aside: When I first heard of named capture groups, they seemed like a great idea, but now I wonder if they really make regexes easier to read than \1, \2....

Ruby scan regex will not match optional

Take this string.
a = "real-ab(+)real-bc(+)real-cd-xy"
a.scan(/[a-z_0-9]+\-[a-z_0-9]+[\-\[a-z_0-9]+\]?/)
=> ["real-ab", "real-bc", "real-cd-xy"]
But how come this next string gets nothing?
a = "real-a(+)real-b(+)real-c"
a.scan(/[a-z_0-9]+\-[a-z_0-9]+[\-\[a-z_0-9]+\]?/)
=> []
How can I have it so both strings output into a 3 count array?
You've confused parentheses (used for grouping) and square brackets (used for character classes). You want
a.scan(/[a-z_0-9]+-[a-z_0-9]+(?:-[a-z_0-9]+)?/)
(?:...) creates a non-capturing group which is what you need here.
Furthermore, unless you want to disallow uppercase letters explicitly, you can write \w as a shorthand for "a letter, digit or underscore":
a.scan(/\w+-\w+(?:-\w+)?/)
a.scan(/[a-z_0-9]+\-[a-z_0-9]+/)
Why not simply?
a.scan(/[a-z_0-9\-]+/)

Resources