ruby regex to match multiple occurrences of pattern - ruby

I am looking to build a ruby regex to match multiple occurrences of a pattern and return them in an array. The pattern is simply: [[.+]]. That is, two left brackets, one or more characters, followed by two right brackets.
This is what I have done:
str = "Some random text[[lead:first_name]] and more stuff [[client:last_name]]"
str.match(/\[\[(.+)\]\]/).captures
The regex above doesn't work because it returns this:
["lead:first_name]] and another [[client:last_name"]
When what I wanted was this:
["lead:first_name", "client:last_name"]
I thought if I used a noncapturing group that for sure it should solve the issue:
str.match(/(?:\[\[(.+)\]\])+/).captures
But the noncapturing group returns the same exact wrong output. Any idea on how I can resolve my issue?

The problem with your regex is that the .+ part is "greedy", meaning that if the regex matches both a smaller and larger part of the string, it will capture the larger part (more about greedy regexes).
In Ruby (and most regex syntaxes), you can qualify your + quantifier with a ? to make it non-greedy. So your regex would become /(?:\[\[(.+?)\]\])+/.
However, you'll notice this still doesn't work for what you want to do. The Ruby capture groups just don't work inside a repeating group. For your problem, you'll need to use scan:
"[[a]][[ab]][[abc]]".scan(/\[\[(.+?)\]\]/).flatten
=> ["a", "ab", "abc"]

Try this:
=> str.match(/\[\[(.*)\]\].*\[\[(.*)\]\]/).captures
=> ["lead:first_name", "client:last_name"]
With many occurrences:
=> str
=> "Some [[lead:first_name]] random text[[lead:first_name]] and more [[lead:first_name]] stuff [[client:last_name]]"
=> str.scan(/\[(\w+:\w+)\]/)
=> [["lead:first_name"], ["lead:first_name"], ["lead:first_name"], ["client:last_name"]]

Related

How to use gsubstitution with more letters

I've printed the code, wit ruby
string = "hahahah"
pring string.gsub("a","b")
How do I add more letter replacements into gsub?
string.gsub("a","b")("h","l") and string.gsub("a","b";"h","l")
didnt work...
*update I have tried this too but without any success .
letters = {
"a" => "l"
"b" => "n"
...
"z" => "f"
}
string = "hahahah"
print string.gsub(\/w\,letters)
You're overcomplicating. As with most method calls in Ruby, you can simply chain #gsub calls together, one after the other:
str = 'adfh'
print str.gsub("a","b").gsub("h","l") #=> 'bdfl'
What you're doing here is applying the second #gsub to the result of the first one.
Of course, that gets a bit long-winded if you do too many of them. So, when you find yourself stringing too many together, you'll want to look for a regex solution. Rubular is a great place to tinker with them.
The way to use your hash trick with #gsub and a regex expression is to provide a hash for all possible matches. This has the same result as the two #gsub calls:
print str.gsub(/[ah]/, {'a'=>'b', 'h'=>'l'}) #=> 'bdfl'
The regex matches either a or h (/[ah]/), and the hash is saying what to substitute for each of them.
All that said, str.tr('ah', 'bl') is the simplest way to solve your problem as specified, as some commenters have mentioned, so long as you are working with single letters. If you need to work with two or more characters per substitution, you'll need to use #gsub.

Capturing groups don't work as expected with Ruby scan method

I need to get an array of floats (both positive and negative) from the multiline string. E.g.: -45.124, 1124.325 etc
Here's what I do:
text.scan(/(\+|\-)?\d+(\.\d+)?/)
Although it works fine on regex101 (capturing group 0 matches everything I need), it doesn't work in Ruby code.
Any ideas why it's happening and how I can improve that?
See scan documentation:
If the pattern contains no groups, each individual result consists of the matched string, $&. If the pattern contains groups, each individual result is itself an array containing one entry per group.
You should remove capturing groups (if they are redundant), or make them non-capturing (if you just need to group a sequence of patterns to be able to quantify them), or use extra code/group in case a capturing group cannot be avoided.
In this scenario, the capturing group is used to quantifiy a pattern sequence, thus all you need to do is convert the capturing group into a non-capturing one by replacing all unescaped ( with (?: (there is only one occurrence here):
text = " -45.124, 1124.325"
puts text.scan(/[+-]?\d+(?:\.\d+)?/)
See demo, output:
-45.124
1124.325
Well, if you need to also match floats like .04 you can use [+-]?\d*\.?\d+. See another demo
There are cases when you cannot get rid of a capturing group, e.g. when the regex contains a backreference to a capturing group. In that case, you may either a) declare a variable to store all matches and collect them all inside a scan block, or b) enclose the whole pattern with another capturing group and map the results to get the first item from each match, c) you may use a gsub with just a regex as a single argument to return an Enumerator, with .to_a to get the array of matches:
text = "11234566666678"
# Variant a:
results = []
text.scan(/(\d)\1+/) { results << Regexp.last_match(0) }
p results # => ["11", "666666"]
# Variant b:
p text.scan(/((\d)\2+)/).map(&:first) # => ["11", "666666"]
# Variant c:
p text.gsub(/(\d)\1+/).to_a # => ["11", "666666"]
See this Ruby demo.
([+-]?\d+\.\d+)
assumes there is a leading digit before the decimal point
see demo at Rubular
If you need capture groups for a complex pattern match, but want the entire expression returned by .scan, this can work for you.
Suppose you want to get the image urls in this string perhaps from a markdown text with html image tags:
str = %(
Before
<img src="https://images.zenhubusercontent.com/11223344e051aa2c30577d9d17/110459e6-915b-47cd-9d2c-1842z4b73d71">
After
<img src="https://user-images.githubusercontent.com/111222333/75255445-f59fb800-57af-11ea-9b7a-a235b84bf150.png">).strip
You may have a regular expression defined to match just the urls, and maybe used a Rubular example like this to build/test your Regexp
image_regex =
/https\:\/\/(user-)?images.(githubusercontent|zenhubusercontent).com.*\b/
Now you don't need each sub-capture group, but just the the entire expression in your your .scan, you can just wrap the whole pattern inside a capture group and use it like this:
image_regex =
/(https\:\/\/(user-)?images.(githubusercontent|zenhubusercontent).com.*\b)/
str.scan(image_regex).map(&:first)
=> ["https://user-images.githubusercontent.com/1949900/75255445-f59fb800-57af-11ea-9b7a-e075f55bf150.png",
"https://user-images.githubusercontent.com/1949900/75255473-02bca700-57b0-11ea-852a-58424698cfb0.png"]
How does this actually work?
Since you have 3 capture groups, .scan alone will return an Array of arrays with, one for each capture:
str.scan(image_regex)
=> [["https://user-images.githubusercontent.com/111222333/75255445-f59fb800-57af-11ea-9b7a-e075f55bf150.png", "user-", "githubusercontent"],
["https://images.zenhubusercontent.com/11223344e051aa2c30577d9d17/110459e6-915b-47cd-9d2c-0714c8f76f68", nil, "zenhubusercontent"]]
Since we only want the 1st (outter) capture group, we can just call .map(&:first)

How to regex the strings in an url

http://something.com/bOhxBeD,SyhyTGi,TMDDSIB,U72gx2J,kQTIRy9,7VXgGDw,eSxIcK6,S5oNlnn,WBHHsLk,BdMGd2d,U9kNlsF,cHVyc7Y,D83kaJ5,cLWgdSO,iWtCIF3,ount8L6
I have tried to get the value: bOhxBeD, SyhyTGi and so on. This is what I come up with ( yes fairly simple ) /([a-zA-Z0-9]{7})/, it seems to work with PCRE:
([a-zA-Z0-9]{7})
Debuggex Demo
But when it comes to Ruby, I use it like this :
str.match(/([a-zA-Z0-9]{7})/)
#<MatchData "bOhxBeD" 1:"bOhxBeD">
it doesn't seem to work. Can anyone point out what's wrong with this regex ? Thanks
You need to add word boundary \b inorder to match an exact 7 alphanumeric characters.
\b[a-zA-Z0-9]{7}\b
DEMO
irb(main):006:0> "http://something.com/bOhxBeD,SyhyTGi,TMDDSIB,U72gx2J,kQTIRy9,7VXgGDw,eSxIcK6,S5oNlnn,WBHHsLk,BdMGd2d,U9kNlsF,cHVyc7Y,D83kaJ5,cLWgdSO,iWtCIF3,ount8L6".scan(/\b([a-zA-Z0-9]{7})\b/)
=> [["bOhxBeD"], ["SyhyTGi"], ["TMDDSIB"], ["U72gx2J"], ["kQTIRy9"], ["7VXgGDw"], ["eSxIcK6"], ["S5oNlnn"], ["WBHHsLk"], ["BdMGd2d"], ["U9kNlsF"], ["cHVyc7Y"], ["D83kaJ5"], ["cLWgdSO"], ["iWtCIF3"], ["ount8L6"]]
(?!.*?\/)[a-zA-Z0-9]{7}
Is should be this.Or else it will pick 7 letter words from link as well."somethi" will be in ans.But i guess that is not required.
match only picks up the first match.
You can try the global version of match which is scan.
You can use scan to search string not containing specific characters using [^...]:
str.scan(/[^\/\.\,]+/)[3..-1]
#=> ["bOhxBeD", "SyhyTGi", "TMDDSIB", "U72gx2J", "kQTIRy9", "7VXgGDw", "eSxIcK6", "S5oNlnn", "WBHHsLk", "BdMGd2d", "U9kNlsF", "cHVyc7Y", "D83kaJ5", "cLWgdSO", "iWtCIF3", "ount8L6"]
Update:
If you know that the strings between the comma are always 7 characters, you can use this instead:
str.scan(/[^\/\.\,]{7}/)[1..-1]
it happens because your regexp match just one element which contain 7 chars, nothing more,
as simple solution could be:
str.match(/\/(.*)\z/)[1].split(',')
You could use String#[] and String#split:
str[/.*\/(.*)/,1].split(',')
#=> ["bOhxBeD", "SyhyTGi", "TMDDSIB", "U72gx2J", "kQTIRy9", "7VXgGDw",
# "eSxIcK6", "S5oNlnn", "WBHHsLk", "BdMGd2d", "U9kNlsF", "cHVyc7Y",
# "D83kaJ5", "cLWgdSO", "iWtCIF3", "ount8L6"]
.*\/ in the regex, "greedy" as it is, will consume characters up to and including the last forward slash in the string. Capture group #1 (.*) sucks up the remainder of the string and, due to the presence of ,1, returns it. split(',') then breaks up the string to give you the desired array.
Another way:
str[str[/.*\//].size..-1].split(',')

Regex to leave desired string remaining and others removed

In Ruby, what regex will strip out all but a desired string if present in the containing string? I know about /[^abc]/ for characters, but what about strings?
Say I have the string "group=4&type_ids[]=2&type_ids[]=7&saved=1" and want to retain the pattern group=\d, if it is present in the string using only a regex?
Currently, I am splitting on & and then doing a select with matching condition =~ /group=\d/ on the resulting enumerable collection. It works fine, but I'd like to know the regex to do this more directly.
Simply:
part = str[/group=\d+/]
If you want only the numbers, then:
group_str = str[/group=(\d+)/,1]
If you want only the numbers as an integer, then:
group_num = str[/group=(\d+)/,1].to_i
Warning: String#[] will return nil if no match occurs, and blindly calling nil.to_i always returns 0.
You can try:
$str =~ s/.*(group=\d+).*/\1/;
Typically I wouldn't really worry too much about a complex regex. Simply break the string down into smaller parts and it becomes easier:
asdf = "group=4&type_ids[]=2&type_ids[]=7&saved=1"
asdf.split('&').select{ |q| q['group'] } # => ["group=4"]
Otherwise, you can use regex a bunch of different ways. Here's two ways I tend to use:
asdf.scan(/group=\d+/) # => ["group=4"]
asdf[/(group=\d+)/, 1] # => "group=4"
Try:
str.match(/group=\d+/)[0]

How can I extract a variable number of sub-matches from a Ruby regex?

I have some strings that I would like to pattern match and then extract out the matches as variables $1, $2, etc.
The pattern matching code I have is
a = /^([\+|\-]?[1-9]?)([C|P])(?:([\+|\-][1-9]?)([C|P]))*$/i.match(field)
puts result = #{a.to_a.inspect}
With the above I am able to easily match the following sample strings:
"C", "+2C", "2c-P", "2C-3P", "P+C"
And I have confirmed all of these work on the Rubular website.
However, when I try to match "+2P-c-3p", it matches however, the MatchData "array-like object" looks like this:
result = ["+2P-C-3P", "+2", "P", "-3", "P"]
The problem is that I am unable to extract into the array, the middle pattern "-C".
What I would expect to see is:
result = ["+2P-C-3P", "+2", "P", "-", "C", "-3", "P"]
It seems to extract only the end part "-3P" as "-3" and "P"
Does anyone know how I can modify my pattern to capture the middle matches ?
So as an other example, +3c+2p-c-4p, I would expect should create:
["+3c+2p-c-4p", "+3", "C", "+2", "P", "-", "C", "-4", "P"]
but what I get is
["+3c+2p-c-4p", "+3", "C", "-4", "P"]
which completely misses the middle part.
You have a profound (but common) misunderstanding how character classes work. This:
[C|P]
is wrong. Unless you want to match pipe | characters. There is no alternation in character classes - they are not like groups. This would be correct:
[CP]
Also, there are no meta-characters in a character class, so you only need to escape very few characters (namely, the closing square bracket ] and the dash -, unless you put it at the end of the group). So your regex reduces to:
^([+-]?\d?)([CP])(?:([+-]?\d?)([CP]))*$
Your second misunderstanding is that group count is dynamic - that you somehow have more groups in the result because more matches occurred in the string. This is not the case.
You have exactly as many groups in your result as you have parentheses pairs in your regex (less the number of non-capturing groups of course). In this case, that number is 4. No more, no less.
If a group matches multiple times, only the contents of the last match occurrence will be retained. There is no way (in Ruby) to get the contents of previous match occurrences for that group.
As an alternative, you could regex-split the string into its meaningful parts and then parse them in a loop to extract all info.
This is what I managed to do :
([+-]?\d?)(C|P)(?=(?:[+-]?\d?[CP])*$)
This way you capture multiple elements.
The only problem is the validity of the string. As ruby doesn't have look-behind I can't check the start of the string, so zerhyju+2P-C-3P is valid (but will only capture +2P-C-3P) whereas +2P-C-3Pzertyuio isn't valid.
If you want to both capture and check if your string is valid, the best way (IMO) is to use two regexes, one to check the value ^(?:[+-]?\d?[CP])*$ and a second one to capture ([+-]?\d?)(C|P) (You could also use ([CP]) for the last part).

Resources