Regular Expression to select text between curly braces, Ruby - ruby

I'm working on a way to filter and replace broken curly brace tags such as {{hello}. I've tried out a few regular expressions from here in Stack and tried on my own. The closes I've come is using this regex
(?=(\}(?!\})))((?<!\})\}) which selects the last tag in the example code block below. However it does not select the entire tag, it just selects the ending curly brace }.
{{hello}}
{{world}}}
{{foobar}}
{{hello}
What I need to do is select any tag that is missing the second ending curly brace like {{hello}. Can anyone help me with the regex to select this type of tag?

filter and replace broken curly brace tags
This problem is really easy to solve if you're not nesting things.
Try this:
[\{]+([^}]+)[}]+
Essentially, you can just replace the match with {{\1}} (or {{$1}}, I forget which one Ruby uses.)
It will work as long as there are one or more of { and } consecutively around the match.

I assume we are given a string containing substrings beginning with "{{", followed by a "tag", which is a string of characters other than "{" and "}", followed by either "}" or "}}". We wish to return the tags that are followed by only one right brace. For example:
str = "Sue said {{hello}}, Bob said {{world}\nTom said {{foobar}}, Lola said {{hello}"
We can use the following regex:
r = /
\{\{ # match {{
([^}]+) # match one or more characters other than } in capture group 1
\} # match }
(?:\z|[^}]) # match end of line or a character other than }
# in a non-capture group
/x # free-spacing regex definition mode
str.scan(r).flatten
#=> ["world", "hello"]
The regex could of course be written in the conventional way:
r = /\{\{([^}]+)\}(?:\z|[^}])/
Note
str.scan(r)
=> [["world"], ["hello"]]
hence the need for flatten.
See String#scan for an explanation.
Obviously, the same regex works if
str = "{{hello}}\n{{world}\n{{foobar}}\n{{hello}"
str.scan(r).flatten
#> ["world", "hello"]
If
words = %w| {{hello}} {{world} {{foobar}} {{hello} |
#=> ["{{hello}}", "{{world}", "{{foobar}}", "{{hello}"]
then
words.select { |w| w =~ r }.map { |w| w[/[^{}]+/] }
=> ["world", "hello"]

I suggest using the following expression:
/(?<!{){{[^{}]+}(?!})/
See the regex101 demo
The pattern will match any string of text that starts with {{ not preceded with {, followed with any 1+ characters other than { and } and then a } that is not followed with }. Thus, this pattern matches strings of exactly {{xxx} structure.
Here is a Ruby demo:
"{{hello}".gsub(/(?<!{){{[^{}]+}(?!})/, "\\0}")
# => {{hello}}
Pattern details:
(?<!{) - a negative lookbehind failing the match if a { appears immediately to the left of the current position
{{ - literal {{
[^{}]+ - 1+ characters other than { and } (to allow empty values, use * instead of +)
} - a closing single }
(?!}) - a negative lookahead failing the match if a } appears right after the previously matched }.

Related

how to get formatted data from a array and convert it back to new array in ruby

I have a data array like below. I need to format it like shown
a = ["8619 [EC006]", "9876 [ED009]", "1034 [AX009]"]
Need to format like
["EC006", "ED009", "AX009"]
arr = ["8619 [EC006]", "9876 [ED009]", "1034 [AX009]"]
To merely extract the strings of interest, assuming the data is formatted correctly, we may write the following.
arr.map { |s| s[/(?<=\[)[^\]]*/] }
#=> ["EC006", "ED009", "AX009"]
See String#[] and Demo
In the regular expression (?<=\[) is a positive lookbehind that asserts the previous character is '['. The ^ at the beginning of the character class [^\]] means that any character other than ']' must be matched. Appending the asterisk ([^\]]*) causes the character class to be matched zero or more times.
Alternatively, we could use the regular expression
/\[\K[^\]]*/
where \K causes the beginning of the match to be reset to the current string location and all previously-matched characters to be discarded from the match that is returned.
To confirm the correctness of the formatting as well, use
arr.map { |s| s[/\A[1-9]\d{3} \[\K[A-Z]{2}\d{3}(?=]\z)/] }
#=> ["EC006", "ED009", "AX009"]
Demo
Note that at the link I replaced \A and \z with ^ and $, respectively, in order to test the regex against multiple strings.
This regular expression can be broken down as follows.
\A # match beginning of string
[1-9] # match a digit other than zero
\d{3} # match 3 digits
[ ] # match one space
\[ # match '['
\K # reset start of match to current stringlocation and discard
# all characters previously matched from match that is returned
[A-Z]{2} # match 2 uppercase letters
\d{3} # match 3 digits
(?=]\z) # positive lookahead asserts following character is
# ']' and that character is at the end of the string
In the above I placed a space character in a character class ([ ]) merely to make it visible to the reader.
Input
a = ["8619 [EC006]", "9876 [ED009]", "1034 [AX009]"]
Code
p a.collect { |x| x[/\[(.*)\]/, 1] }
Output
["EC006", "ED009", "AX009"]

Positive Lookahead and Non-capturing group difference

When you want to match either of two patterns but not capture it, you would use a noncapturing group ?::
/(?:https?|ftp)://(.+)/
But what if I want to capture '_1' in the string 'john_1'. It could be '2' or '' followed by anything else. First I tried a non-capturing group:
'john_1'.gsub(/(?:.+)(_.+)/, "")
=> ""
It does not work. I am telling it to not capture one or more characters but to capture _ and all characters after it.
Instead the following works:
'john_1'.gsub(/(?=.+)(_.+)/, "")
=> "john"
I used a positive lookahead. The definition I found for positive lookahead was as follows:
q(?=u) matches a q that is
followed by a u, without making the u part of the match. The positive
lookahead construct is a pair of parentheses, with the opening
parenthesis followed by a question mark and an equals sign.
But that definition doesn't really fit my example. What makes the Positive Lookahead work but not the Non-capturing group work in the example I provide?
Capturing and matching are two different things. (?:expr) doesn't capture expr, but it's still included in the matched string. Zero-width assertions, e.g. (?=expr), don't capture or include expr in the matched string.
Perhaps some examples will help illustrate the difference:
> "abcdef"[/abc(def)/] # => abcdef
> $1 # => def
> "abcdef"[/abc(?:def)/] # => abcdef
> $1 # => nil
> "abcdef"[/abc(?=def)/] # => abc
> $1 # => nil
When you use a non-capturing group in your String#gsub call, it's still part of the match, and gets replaced by the replacement string.
Your first example doesn't work because a non-capturing group is still part of the overall capture, whereas the lookbehind is only used for matching but isn't part of the overall capture.
This is easier to understand if you get the actual match data:
# Non-capturing group
/(?:.+)(_.+)/.match 'john_1'
=> #<MatchData "john_1" 1:"_1">
# Positive Lookbehind
/(?=.+)(_.+)/.match 'john_1'
=> #<MatchData "_1" 1:"_1">
EDIT: I should also mention that sub and gsub work on the entire capture, not individual capture groups (although those can be used in the replacement).
'john_1'.gsub(/(?:.+)(_.+)/, 'phil\1')
=> "phil_1"
Let's consider a couple of situations.
The string preceding the underscore must be "john" and the underscore is followed by one or more characters
str = "john_1"
You have two choices.
Use a positive lookbehind
str[/(?<=john)_.+/]
#=> "_1"
The positive lookbehind requires that "john" must appear immediately before the underscore, but it is not part of the match that is returned.
Use a capture group:
str[/john(_.+)/, 1]
#=> "_1"
This regular expression matches "john_1", but "_.+" is captured in capture group 1. By examining the doc for the method String#[] you will see that one form of the method is str[regexp, capture], which returns the contents of the capture group capture. Here capture equals 1, meaning the first capture group.
Note that the string following the underscore may contain underscores: "john_1_a"[/(?<=john)_.+/] #=> "_1_a".
If the underscore can be at the end of the string replace + with * in the above regular expressions (meaning match zero or more characters after the underscore).
The string preceding the underscore can be anything and and the underscore is followed by one or more characters
str = "john_mary_tom_julie"
We may consider two cases.
The string returned is to begin with the first underscore
In this case we could write:
str[/_.+/]
#=> "_mary_tom_julie"
This works because the regex is by default greedy, meaning it will begin at the first underscore encountered.
The string returned is to begin with the last underscore
Here we could write:
str[/_[^_]+\z/]
#=> "_julie"
This regex matches an underscore followed by one or more characters that are not underscores, followed by the end-of-string anchor (\z).
Aside: the method String#[]
[] may seem an odd name for a method but it is a method nevertheless, so it can be invoked in the conventional way:
str.[](/john(_.+)/, 1)
#=> "_1"
The expression str[/john(_.+)/, 1] is an example (of which there are many in Ruby) of syntactic sugar. When written str[...] Ruby converts it to the conventional expression for methods before evaluating it.

Regex to grab full firstname and first letter of last name

I have a list of users grabbed by the Etc Ruby library:
Thomas_J_Perkins
Jennifer_Scanner
Amanda_K_Loso
Aaron_Cole
Mark_L_Lamb
What I need to do is grab the full first name, skip the middle name (if given), and grab the first character of the last name. The output should look like this:
Thomas P
Jennifer S
Amanda L
Aaron C
Mark L
I'm not sure how to do this, I've tried grabbing all of the characters: /\w+/ but that will grab everything.
You don't always need regular expressions.
Some people, when confronted with a problem, think "I know, I'll use
regular expressions." Now they have two problems. Jamie Zawinski
You can do it with some simple Ruby code
string = "Mark_L_Lamb"
string.split('_').first + ' ' + string.split('_').last[0]
=> "Mark L"
I think its simpler without regex:
array = "Thomas_J_Perkins".split("_") # split at _
array.first + " " + array.last[0] # .first prints first name .last[0] prints first char of last name
#=> "Thomas P"
You can use
^([^\W_]+)(?:_[^\W_]+)*_([^\W_])[^\W_]*$
And replace with \1_\2. See the regex demo
The [^\W_] matches a letter or a digit. If you want to only match letters, replace [^\W_] with \p{L}.
^(\p{L}+)(?:_\p{L}+)*_(\p{L})\p{L}*$
See updated demo
The point is to match and capture the first chunk of letters up to the first _ (with (\p{L}+)), then match 0+ sequences of _ + letters inside (with (?:_\p{L}+)*_) and then match and capture the last word first letter (with (\p{L})) and then match the rest of the string (with \p{L}*).
NOTE: replace ^ with \A and $ with \z if you have independent strings (as in Ruby ^ matches the start of a line and $ matches the end of the line).
Ruby code:
s.sub(/^(\p{L}+)(?:_\p{L}+)*_(\p{L})\p{L}*$/, "\\1_\\2")
I'm in the don't-use-a-regex-for-this camp.
str1 = "Alexander_Graham_Bell"
str2 = "Sylvester_Grisby"
"#{str1[0...str1.index('_')]} #{str1[str1.rindex('_')+1]}"
#=> "Alexander B"
"#{str2[0...str2.index('_')]} #{str2[str2.rindex('_')+1]}"
#=> "Sylvester G"
or
first, last = str1.split(/_.+_|_/)
#=> ["Alexander", "Bell"]
first+' '+last[0]
#=> "Alexander B"
first, last = str2.split(/_.+_|_/)
#=> ["Sylvester", "Grisby"]
first+' '+last[0]
#=> "Sylvester G"
but if you insist...
r = /
(.+?) # match any characters non-greedily in capture group 1
(?=_) # match an underscore in a positive lookahead
(?:.*) # match any characters greedily in a non-capture group
(?:_) # match an underscore in a non-capture group
(.) # match any character in capture group 2
/x # free-spacing regex definition mode
str1 =~ r
$1+' '+$2
#=> "Alexander B"
str2 =~ r
$1+' '+$2
#=> "Sylvester G"
You can of course write
r = /(.+?)(?=_)(?:.*)(?:_)(.)/
This is my attempt:
/([a-zA-Z]+)_([a-zA-Z]+_)?([a-zA-Z])/
See demo
Let's see if this works:
/^([^_]+)(?:_\w)?_(\w)/
And then you'll have to combine the first and second matches into the format you want. I don't know Ruby, so I can't help you there.
And another attempt using a replacement method:
result = subject.gsub(/^([^_]+)(?:_[^_])?_([^_])[^_]+$/, '\1 \2')
We capture the entire string, with the relevant parts in capturing groups. Then just return the two captured groups
using the split method is much better
full_names.map do |full_name|
parts = full_name.split('_').values_at(0,-1)
parts.last.slice!(1..-1)
parts.join(' ')
end
/^[A-Za-z]{5,15}\s[A-Za-z]{1}]$/i
This will have the following criteria:
5-15 characters for first name then a whitespace and finally a single character for last name.

Regular expression - get multiple matches into each group

I have a string like this:
raw_string = "(a=1)(b=2)(c=3)"
I'd like to match this and get values within each set of parentheses, and get each result in a group.
For example:
group 0 = "a=1"
group 1 = "b=2" and so on..
I've tried /(\(.*\))/g but it doesn't seem to work. Can someone help me with this?
thanks!
str = "(a=1)(b=2) (c=3)"
As suggested in a comment by #stribizhev:
r = /
\( # Match a left paren
([^\)]+) # Match >= 1 characters other than a right paren in capture group 1
\) # Match a right paren
/x # extended/free-spacing regex definition mode
str.scan(r).flatten
#=> ["a=1", "b=2", "c=3"]
Note ([^\)]+) could replaced by (.+?), making it a lazy match on any characters, as I've done in this alternative regex, which uses lookarounds rather than a capture group:
r = /
(?<=\() # Match a left paren in a positive lookbehind
.+? # Match >= 1 characters lazily
(?=\)) # Match a right paren in a positive lookahead
/x
Here the lookbehind could be replaced by \(\K, which reads, "match a left paren then forget about everything matched so far".
Lastly, you could use String#split on the right then left paren, possibly separated by spaces, then delete the first left and last right parens:
str.split(/\)\s*\(/).map { |s| s.delete '()' }
#=> ["a=1", "b=2", "c=3"]
Wouldn't it be nice if we could write s.strip(/[()]/)?
If you mean the pattern with parentheses appears exactly three times (or a different fixed number of times), then it is possible, but if you intend that the pattern appears an arbitrary number of times, then you can't. A regex can only have a fixed number of captures or named captures.
Just to show that you can get them into an arbitrary number of capture groups:
"(a=1)(b=2)(c=3)"[/#{'(?:\((.*?)\))?' * 99}/]
[$1, $2, $3]
#=> ["a=1", "b=2", "c=3"]

Regex for matching all words between a set of curly braces

A simple question for most regex experts I know, but I'm trying to return all matches for the words between some curly braces in a sentence; however Ruby is only returning a single match, and I cannot figure out why exactly.
I'm using this example sentence:
sentence = hello {name} is {thing}
with this regex to try and return both "{name}" and "{thing}":
sentence[/\{(.*?)\}/]
However, Ruby is only returning "{name}". Can anyone explain why it doesn't match for both words?
You're close, but using the wrong method:
sentence = "hello {name} is {thing}"
sentence.scan(/\{(.*?)\}/)
# => [["name"], ["thing"]]
One can do that using a positive lookbehind, (?<=\{), to require that the match be immediately preceded by a left brace, and a positive lookahead, to require that the match be immediately followed by a right brace.
str = "hello {name} is {thing}"
str.scan /(?<=\{).*?(?=\})/
#=> ["name", "thing"]
If there could be nested braces and only the strings within the inner braces were desired, .*? needs to be replace with [^{]*?:
str = "hello {my {name} is {thing} from} the swamp"
str.scan /(?<=\{)[^{]*?(?=\})/
#=> ["name", "thing"]

Resources