How do I write a regex that captures the first non-numeric part of string that also doesn't include 3 or more spaces? - ruby

I'm using Ruby 2.4. I want to extract from a string the first consecutive occurrence of non-numeric characters that do not include at least three or more spaces. For example, in this string
str = "123 aa bb cc 33 dd"
The first such occurrence is " aa bb ". I thought the below expression would help me
data.split(/[[:space:]][[:space:]][[:space:]]+/).first[/\p{L}\D+\p{L}\p{L}/i]
but if the string is "123 456 aaa", it fails to return " aaa", which I would want it to.

r = /
(?: # begin non-capture group
[ ]{,2} # match 0, 1 or 2 spaces
[^[ ]\d]+ # match 1+ characters that are neither spaces nor digits
)+ # end non-capture group and perform 1+ times
[ ]{,2} # match 0, 1 or 2 spaces
/x # free-spacing regex definition mode
str = "123 aa bb cc 33 dd"
str[r] #=> " aa bb "
Note that [ ] could be replaced by a space if free-spacing regex definition mode is not used:
r = /(?: {,2}[^ \d]+)+ {,2}/

Remove all digits + spaces from the start of a string. Then split with 3 or more whitespaces and grab the first item.
def parse_it(s)
s[/\A(?:[\d[:space:]]*\d)?(\D+)/, 1].split(/[[:space:]]{3,}/).first
end
puts parse_it("123 aa bb cc 33 dd")
# => aa bb
puts parse_it("123 456 aaa")
# => aaa
See the Ruby demo
The first regex \A(?:[\d[:space:]]*\d)?(\D+) matches:
\A - start of a string
(?:[\d[:space:]]*\d)? - an optional sequence of:
[\d[:space:]]* - 0+ digits or whitespaces
\d - a digit
(\D+) -Group 1 capturing 1 or more non-digits
The splitting regex is [[:space:]]{3,}, it matches 3 or more whitespaces.

It looks like this'd do it:
regex = /(?: {1,2}[[:alpha:]]{2,})+/
"123 aa bb cc 33 dd"[regex] # => " aa bb"
"123 456 aaa"[regex] # => " aaa"
(?: ... ) is a non-capturing group.
{1,2} means "find at least one, and at most two".
[[:alpha:]] is a POSIX definition for alphabet characters. It's more comprehensive than [a-z].
You should be able to figure out the rest, which is all documented in the Regexp documentation and String's [] documentation.

Will this work?
str.match(/(?: ?)?(?:[^ 0-9]+(?: ?)?)+/)[0]
or apparently
str[/(?: ?)?(?:[^ 0-9]+(?: ?)?)+/]
or using Cary's nice space match,
str[/ {,2}(?:[^ 0-9]+ {,2})+/]

Related

How to match the last occurrence of a pattern?

Given that I have the following string:
String 1 | string 2 | string 3
I want my regex to match the value after the last pipe and space, which in this case is "string 3".
Right now I am doing using this: /[^|]+$/i but it also return the space character after the pipe.
https://regex101.com/r/stnW0D/1
Without regex:
"String 1 | string 2 | string 3".split(" | ").last # => "string 3"
'String 1 | string 2 | string 3'[/(?<=\|\s)(\w+\s\d+\z)/]
# "string 3"
Where (escaped):
\| # a pipe
\s # a whitespace
\w+ # one or more of any word character
\s # a whitespace
\d+ # one or more digits
\z # end of string
(...) # capture everything enclosed
(?<=...) # a positive lookbehind
Notice in this case the regex is already getting the last occurrence of the pattern in the string, by attaching it to the end of the string (\z). In other case you could use [\|\s]? instead of (\|\s) to match the string followed by a whitespace and a number and from there, access the last element in the returned array:
'String 1 | string 2 | string 3'.scan(/[\|\s]?(\w+\s\d+)/).last
# ["string 3"]
str = "string 1 | string 2 | string 3"
str[/[^ |][^|]*\z/]
#=> "string 3"

How do I write a regex that eliminates the space between a number and a colon?

I want to replace a space between one or two numbers and a colon followed by a space, a number, or the end of the line. If I have a string like,
line = " 0 : 28 : 37.02"
the result should be:
" 0: 28: 37.02"
I tried as below:
line.gsub!(/(\A|[ \u00A0|\r|\n|\v|\f])(\d?\d)[ \u00A0|\r|\n|\v|\f]:(\d|[ \u00A0|\r|\n|\v|\f]|\z)/, '\2:\3')
# => " 0: 28 : 37.02"
It seems to match the first ":", but the second ":" is not matched. I can't figure out why.
The problem
I'll define your regex with comments (in free-spacing mode) to show what it is doing.
r =
/
( # begin capture group 1
\A # match beginning of string (or does it?)
| # or
[ \u00A0|\r|\n|\v|\f] # match one of the characters in the string " \u00A0|\r\n\v\f"
) # end capture group 1
(\d?\d) # match one or two digits in capture group 2
[ \u00A0|\r|\n|\v|\f] # match one of the characters in the string " \u00A0|\r\n\v\f"
: # match ":"
( # begin capture group 3
\d # match a digit
| # or
[ \u00A0|\r|\n|\v|\f] # match one of the characters in the string " \u00A0|\r\n\v\f"
| # or
\z # match the end of the string
) # end capture group 3
/x # free-spacing regex definition mode
Note that '|' is not a special character ("or") within a character class. It's treated as an ordinary character. (Even if '|' were treated as "or" within a character class, that would serve no purpose because character classes are used to force any one character within it to be matched.)
Suppose
line = " 0 : 28 : 37.02"
Then
line.gsub(r, '\2:\3')
#=> " 0: 28 : 37.02"
$1 #=> " "
$2 #=> "0"
$3 #=> " "
In capture group 1 the beginning of the line (\A) is not matched because it is not a character and only characters are not matched (though I don't know why that does not raise an exception). The special character for "or" ('|') causes the regex engine to attempt to match one character of the string " \u00A0|\r\n\v\f". It therefore would match one of the three spaces at the beginning of the string line.
Next capture group 2 captures "0". For it to do that, capture group 1 must have captured the space at index 2 of line. Then one more space and a colon are matched, and lastly, capture group 3 takes the space after the colon.
The substring ' 0 : ' is therefore replaced with '\2:\3' #=> '0: ', so gsub returns " 0: 28 : 37.02". Notice that one space before '0' was removed (but should have been retained).
A solution
Here's how you can remove the last of one or more Unicode whitespace characters that are preceded by one or two digits (and not more) and are followed by a colon at the end of the string or a colon followed by a whitespace or digit. (Whew!)
def trim(str)
str.gsub(/\d+[[:space:]]+:(?![^[:space:]\d])/) do |s|
s[/\d+/].size > 2 ? s : s[0,s.size-2] << ':'
end
end
The regular expression reads, "match one or more digits followed by one or more whitespace characters, followed by a colon (all these characters are matched), not followed (negative lookahead) by a character other than a unicode whitespace or digit". If there is a match, we check to see how many digits there are at the beginning. If there are more than two the match is returned (no change), else the whitespace character before the colon is removed from the match and the modified match is returned.
trim " 0 : 28 : 37.02"
#=> " 0: 28: 37.02" xxx
trim " 0\v: 28 :37.02"
#=> " 0: 28:37.02"
trim " 0\u00A0: 28\n:37.02"
#=> " 0: 28:37.02"
trim " 123 : 28 : 37.02"
#=> " 123 : 28: 37.02"
trim " A12 : 28 :37.02"
#=> " A12: 28:37.02"
trim " 0 : 28 :"
#=> " 0: 28:"
trim " 0 : 28 :A"
#=> " 0: 28 :A"
If, as in the example, the only characters in the string are digits, whitespaces and colons, the lookbehind is not needed.
You can use Ruby's \p{} construct, \p{Space}, in place of the POSIX expression [[:space:]]. Both match a class of Unicode whitespace characters, including those shown in the examples.
Excluding the third digit can be done with a negative lookback, but since the other one or two digits are of variable length, you cannot use positive lookback for that part.
line.gsub(/(?<!\d)(\d{1,2}) (?=:[ \d\$])/, '\1')
# => " 0: 28: 37.02"
" 0 : 28 : 37.02".gsub!(/(\d)(\s)(:)/,'\1\3')
=> " 0: 28: 37.02"

Remove duplicate substrings from a string

str = "hi ram hi shyam hi jhon"
I want something like:
"ram hi shyam hi jhon"
"ram shyam hi jhon"
I assume you want to remove duplicate occurrences of all words, not just "hi". Here are two ways of doing that.
1 Use String#reverse, Array#reverse and Array#uniq
str = "hi shyam ram hi shyam hi jhon"
str.split.reverse.uniq.reverse.join(' ')
#=> "ram shyam hi jhon"
The doc for uniq states: "self is traversed in order, and the first occurrence is kept."
2 Use a regular expression
r = /
\b # match a word break
(\w+) # match a word in capture group 1
\s # match a trailing space
(?= # begin a positive lookahead
.* # match any number of characters
\s # match a space
\1 # match the contents of capture group 1
\b # match a word break
) # end the positive lookahead
/x # free-spacing regex definition mode
str.gsub(r, '')
#=> "ram shyam hi jhon"
To remove the extra spaces change \s to \s+ in the third line of the regex definition.
str = "hi ram hi shyam hi jhon"
To remove one occurrence:
str.sub('hi', '').strip.squeeze
#⇒ "ram hi shyam hi jhon"
To remove n occurrences:
n.times { str.replace(str.sub('hi', '').strip.squeeze) }
You are looking for sub!:
str = "hi ram hi shyam hi jhon"
str.sub!("hi ", "")
#=> "ram hi shyam hi jhon"
str.sub!("hi ", "")
#=> "ram shyam hi jhon"
str.sub!("hi ", "")
#=> "ram shyam jhon"
In-case you do not what to modify your original string, which is not how the example looks like, you might want to use sub instead and an extra variable

Where did the character go?

I matched a string against a regex:
s = "`` `foo`"
r = /(?<backticks>`+)(?<inline>.+)\g<backticks>/
And I got:
s =~ r
$& # => "`` `foo`"
$~[:backticks] # => "`"
$~[:inline] # => " `foo"
Why is $~[:inline] not "` `foo"? Since $& is s, I expect:
$~[:backticks] + $~[:inline] + $~[:backticks]
to be s, but it is not, one backtick is gone. Where did the backtick go?
It is actually expected. Look:
(?<backticks>`+) - matches 1+ backticks and stores them in the named capture group "backticks" (there are two backticks). Then...
(?<inline>.+) - 1+ characters other than a newline are matched into the "inline" named capture group. It grabs all the string and backtracks to yield characters to the recursed subpattern that is actually the "backticks" capture group. So,...
\g<backticks> - finds 1 backtick that is at the end of the string. It satisfies the condition to match 1+ backticks. The named capture "backtick" buffer is re-written here.
The matching works like this:
"`` `foo`"
||1
| 2 |
|3
And then 1 becomes 3, and since 1 and 3 are the same group, you see one backtick.

what would the regular expression to extract the 3 from be?

I basically need to get the bit after the last pipe
"3083505|07733366638|3"
What would the regular expression for this be?
You can do this without regex. Here:
"3083505|07733366638|3".split("|").last
# => "3"
With regex: (assuming its always going to be integer values)
"3083505|07733366638|3".scan(/\|(\d+)$/)[0][0] # or use \w+ if you want to extract any word after `|`
# => "3"
Try this regex :
.*\|(.*)
It returns whatever comes after LAST | .
You could do that most easily by using String#rindex:
line = "3083505|07733366638|37"
line[line.rindex('|')+1..-1]
#=> "37"
If you insist on using a regex:
r = /
.* # match any number of any character (greedily!)
\| # match pipe
(.+) # match one or more characters in capture group 1
/x # extended mode
line[r,1]
#=> "37"
Alternatively:
r = /
.* # match any number of any character (greedily!)
\| # match pipe
\K # forget everything matched so far
.+ # match one or more characters
/x # extended mode
line[r]
#=> "37"
or, as suggested by #engineersmnky in a comment on #shivam's answer:
r = /
(?<=\|) # match a pipe in a positive lookbehind
\d+ # match any number of digits
\z # match end of string
/x # extended mode
line[r]
#=> "37"
I would use split and last, but you could do
last_field = line.sub(/.+\|/, "")
That remove all chars up to and including the last pipe.

Resources