How to match the last occurrence of a pattern? - ruby

Given that I have the following string:
String 1 | string 2 | string 3
I want my regex to match the value after the last pipe and space, which in this case is "string 3".
Right now I am doing using this: /[^|]+$/i but it also return the space character after the pipe.
https://regex101.com/r/stnW0D/1

Without regex:
"String 1 | string 2 | string 3".split(" | ").last # => "string 3"

'String 1 | string 2 | string 3'[/(?<=\|\s)(\w+\s\d+\z)/]
# "string 3"
Where (escaped):
\| # a pipe
\s # a whitespace
\w+ # one or more of any word character
\s # a whitespace
\d+ # one or more digits
\z # end of string
(...) # capture everything enclosed
(?<=...) # a positive lookbehind
Notice in this case the regex is already getting the last occurrence of the pattern in the string, by attaching it to the end of the string (\z). In other case you could use [\|\s]? instead of (\|\s) to match the string followed by a whitespace and a number and from there, access the last element in the returned array:
'String 1 | string 2 | string 3'.scan(/[\|\s]?(\w+\s\d+)/).last
# ["string 3"]

str = "string 1 | string 2 | string 3"
str[/[^ |][^|]*\z/]
#=> "string 3"

Related

Word after a particular word in a string

I have string say e.g. ab_abc_bbb_ccc_ssss_pppp, I want the word after ccc i.e. ssss from the string, how to achieve the same using unix command
you mean something like this?
echo "ab_abc_bbb_ccc_ssss_pppp" | sed 's/.*ccc_\([^_]*\).*/\1/'
explanation
s/ # substitute
.*ccc_ # find search pattern
\([^_]*\) # save all chars without '_' into arg1 (\1)
_.*/ # ignore trailing chars
\1/ # print \1
output
ssss

How do I write a regex that eliminates the space between a number and a colon?

I want to replace a space between one or two numbers and a colon followed by a space, a number, or the end of the line. If I have a string like,
line = " 0 : 28 : 37.02"
the result should be:
" 0: 28: 37.02"
I tried as below:
line.gsub!(/(\A|[ \u00A0|\r|\n|\v|\f])(\d?\d)[ \u00A0|\r|\n|\v|\f]:(\d|[ \u00A0|\r|\n|\v|\f]|\z)/, '\2:\3')
# => " 0: 28 : 37.02"
It seems to match the first ":", but the second ":" is not matched. I can't figure out why.
The problem
I'll define your regex with comments (in free-spacing mode) to show what it is doing.
r =
/
( # begin capture group 1
\A # match beginning of string (or does it?)
| # or
[ \u00A0|\r|\n|\v|\f] # match one of the characters in the string " \u00A0|\r\n\v\f"
) # end capture group 1
(\d?\d) # match one or two digits in capture group 2
[ \u00A0|\r|\n|\v|\f] # match one of the characters in the string " \u00A0|\r\n\v\f"
: # match ":"
( # begin capture group 3
\d # match a digit
| # or
[ \u00A0|\r|\n|\v|\f] # match one of the characters in the string " \u00A0|\r\n\v\f"
| # or
\z # match the end of the string
) # end capture group 3
/x # free-spacing regex definition mode
Note that '|' is not a special character ("or") within a character class. It's treated as an ordinary character. (Even if '|' were treated as "or" within a character class, that would serve no purpose because character classes are used to force any one character within it to be matched.)
Suppose
line = " 0 : 28 : 37.02"
Then
line.gsub(r, '\2:\3')
#=> " 0: 28 : 37.02"
$1 #=> " "
$2 #=> "0"
$3 #=> " "
In capture group 1 the beginning of the line (\A) is not matched because it is not a character and only characters are not matched (though I don't know why that does not raise an exception). The special character for "or" ('|') causes the regex engine to attempt to match one character of the string " \u00A0|\r\n\v\f". It therefore would match one of the three spaces at the beginning of the string line.
Next capture group 2 captures "0". For it to do that, capture group 1 must have captured the space at index 2 of line. Then one more space and a colon are matched, and lastly, capture group 3 takes the space after the colon.
The substring ' 0 : ' is therefore replaced with '\2:\3' #=> '0: ', so gsub returns " 0: 28 : 37.02". Notice that one space before '0' was removed (but should have been retained).
A solution
Here's how you can remove the last of one or more Unicode whitespace characters that are preceded by one or two digits (and not more) and are followed by a colon at the end of the string or a colon followed by a whitespace or digit. (Whew!)
def trim(str)
str.gsub(/\d+[[:space:]]+:(?![^[:space:]\d])/) do |s|
s[/\d+/].size > 2 ? s : s[0,s.size-2] << ':'
end
end
The regular expression reads, "match one or more digits followed by one or more whitespace characters, followed by a colon (all these characters are matched), not followed (negative lookahead) by a character other than a unicode whitespace or digit". If there is a match, we check to see how many digits there are at the beginning. If there are more than two the match is returned (no change), else the whitespace character before the colon is removed from the match and the modified match is returned.
trim " 0 : 28 : 37.02"
#=> " 0: 28: 37.02" xxx
trim " 0\v: 28 :37.02"
#=> " 0: 28:37.02"
trim " 0\u00A0: 28\n:37.02"
#=> " 0: 28:37.02"
trim " 123 : 28 : 37.02"
#=> " 123 : 28: 37.02"
trim " A12 : 28 :37.02"
#=> " A12: 28:37.02"
trim " 0 : 28 :"
#=> " 0: 28:"
trim " 0 : 28 :A"
#=> " 0: 28 :A"
If, as in the example, the only characters in the string are digits, whitespaces and colons, the lookbehind is not needed.
You can use Ruby's \p{} construct, \p{Space}, in place of the POSIX expression [[:space:]]. Both match a class of Unicode whitespace characters, including those shown in the examples.
Excluding the third digit can be done with a negative lookback, but since the other one or two digits are of variable length, you cannot use positive lookback for that part.
line.gsub(/(?<!\d)(\d{1,2}) (?=:[ \d\$])/, '\1')
# => " 0: 28: 37.02"
" 0 : 28 : 37.02".gsub!(/(\d)(\s)(:)/,'\1\3')
=> " 0: 28: 37.02"

How do I write a regex that captures the first non-numeric part of string that also doesn't include 3 or more spaces?

I'm using Ruby 2.4. I want to extract from a string the first consecutive occurrence of non-numeric characters that do not include at least three or more spaces. For example, in this string
str = "123 aa bb cc 33 dd"
The first such occurrence is " aa bb ". I thought the below expression would help me
data.split(/[[:space:]][[:space:]][[:space:]]+/).first[/\p{L}\D+\p{L}\p{L}/i]
but if the string is "123 456 aaa", it fails to return " aaa", which I would want it to.
r = /
(?: # begin non-capture group
[ ]{,2} # match 0, 1 or 2 spaces
[^[ ]\d]+ # match 1+ characters that are neither spaces nor digits
)+ # end non-capture group and perform 1+ times
[ ]{,2} # match 0, 1 or 2 spaces
/x # free-spacing regex definition mode
str = "123 aa bb cc 33 dd"
str[r] #=> " aa bb "
Note that [ ] could be replaced by a space if free-spacing regex definition mode is not used:
r = /(?: {,2}[^ \d]+)+ {,2}/
Remove all digits + spaces from the start of a string. Then split with 3 or more whitespaces and grab the first item.
def parse_it(s)
s[/\A(?:[\d[:space:]]*\d)?(\D+)/, 1].split(/[[:space:]]{3,}/).first
end
puts parse_it("123 aa bb cc 33 dd")
# => aa bb
puts parse_it("123 456 aaa")
# => aaa
See the Ruby demo
The first regex \A(?:[\d[:space:]]*\d)?(\D+) matches:
\A - start of a string
(?:[\d[:space:]]*\d)? - an optional sequence of:
[\d[:space:]]* - 0+ digits or whitespaces
\d - a digit
(\D+) -Group 1 capturing 1 or more non-digits
The splitting regex is [[:space:]]{3,}, it matches 3 or more whitespaces.
It looks like this'd do it:
regex = /(?: {1,2}[[:alpha:]]{2,})+/
"123 aa bb cc 33 dd"[regex] # => " aa bb"
"123 456 aaa"[regex] # => " aaa"
(?: ... ) is a non-capturing group.
{1,2} means "find at least one, and at most two".
[[:alpha:]] is a POSIX definition for alphabet characters. It's more comprehensive than [a-z].
You should be able to figure out the rest, which is all documented in the Regexp documentation and String's [] documentation.
Will this work?
str.match(/(?: ?)?(?:[^ 0-9]+(?: ?)?)+/)[0]
or apparently
str[/(?: ?)?(?:[^ 0-9]+(?: ?)?)+/]
or using Cary's nice space match,
str[/ {,2}(?:[^ 0-9]+ {,2})+/]

Where did the character go?

I matched a string against a regex:
s = "`` `foo`"
r = /(?<backticks>`+)(?<inline>.+)\g<backticks>/
And I got:
s =~ r
$& # => "`` `foo`"
$~[:backticks] # => "`"
$~[:inline] # => " `foo"
Why is $~[:inline] not "` `foo"? Since $& is s, I expect:
$~[:backticks] + $~[:inline] + $~[:backticks]
to be s, but it is not, one backtick is gone. Where did the backtick go?
It is actually expected. Look:
(?<backticks>`+) - matches 1+ backticks and stores them in the named capture group "backticks" (there are two backticks). Then...
(?<inline>.+) - 1+ characters other than a newline are matched into the "inline" named capture group. It grabs all the string and backtracks to yield characters to the recursed subpattern that is actually the "backticks" capture group. So,...
\g<backticks> - finds 1 backtick that is at the end of the string. It satisfies the condition to match 1+ backticks. The named capture "backtick" buffer is re-written here.
The matching works like this:
"`` `foo`"
||1
| 2 |
|3
And then 1 becomes 3, and since 1 and 3 are the same group, you see one backtick.

what would the regular expression to extract the 3 from be?

I basically need to get the bit after the last pipe
"3083505|07733366638|3"
What would the regular expression for this be?
You can do this without regex. Here:
"3083505|07733366638|3".split("|").last
# => "3"
With regex: (assuming its always going to be integer values)
"3083505|07733366638|3".scan(/\|(\d+)$/)[0][0] # or use \w+ if you want to extract any word after `|`
# => "3"
Try this regex :
.*\|(.*)
It returns whatever comes after LAST | .
You could do that most easily by using String#rindex:
line = "3083505|07733366638|37"
line[line.rindex('|')+1..-1]
#=> "37"
If you insist on using a regex:
r = /
.* # match any number of any character (greedily!)
\| # match pipe
(.+) # match one or more characters in capture group 1
/x # extended mode
line[r,1]
#=> "37"
Alternatively:
r = /
.* # match any number of any character (greedily!)
\| # match pipe
\K # forget everything matched so far
.+ # match one or more characters
/x # extended mode
line[r]
#=> "37"
or, as suggested by #engineersmnky in a comment on #shivam's answer:
r = /
(?<=\|) # match a pipe in a positive lookbehind
\d+ # match any number of digits
\z # match end of string
/x # extended mode
line[r]
#=> "37"
I would use split and last, but you could do
last_field = line.sub(/.+\|/, "")
That remove all chars up to and including the last pipe.

Resources