I need a regex to capture just the number from a string - ruby

I do not have access to the code, this is via an interface that only allows me to edit the regex that parses user responses. I need to extract the weight after users text, where they text things like:
wt 172.5
172.5 lbs
180
wt. 173.22
172,5
I need to capture the weight as a float field, but I want to restrict it to at most 1 decimal place. I tried using /(?<val>[\d+((\.|,)\d\d?)?]/ but it is only saving the first digit "1" in the field

Sometimes what seems most simple is not. I suggest using this regex:
r = /(?<=\A|\s)\d+(?:[.,]\d)?(?=\d|\s|\z)/
We can alternatively define the regex using extended or free-spacing mode (by adding the modifier x after the final /), which allows us to include documentation:
r = /
(?<=\A|\s) # match beginning of string or space in a positive lookbehind
\d+ # match one or more digits
(?:[.,]\d)? # optionally (? after non-capture group) match a . or , then a digit
(?=\d|\s|\z) # match a digit, space or the end of the string in a positive lookahead
/x
"wt 172.5"[r] #=> "172.5"
"172.5 lbs"[r] #=> "172.5"
"180"[r] #=> "180"
"wt. 173.22"[r] #=> "173.2"
"172,5"[r] #=> "172,5"
"A1 143.66"[r] #=> "143.6"
"A1 1.3.4 43.6"[r] #=> "43.6"

\d+(?:[,.]\d{1,2})?
Guess you wanted this .[] is character class,not what you think.Your character class captures just one out of all characters you have defined.
See demo.
https://regex101.com/r/eB8xU8/12

Related

Positive Lookahead and Non-capturing group difference

When you want to match either of two patterns but not capture it, you would use a noncapturing group ?::
/(?:https?|ftp)://(.+)/
But what if I want to capture '_1' in the string 'john_1'. It could be '2' or '' followed by anything else. First I tried a non-capturing group:
'john_1'.gsub(/(?:.+)(_.+)/, "")
=> ""
It does not work. I am telling it to not capture one or more characters but to capture _ and all characters after it.
Instead the following works:
'john_1'.gsub(/(?=.+)(_.+)/, "")
=> "john"
I used a positive lookahead. The definition I found for positive lookahead was as follows:
q(?=u) matches a q that is
followed by a u, without making the u part of the match. The positive
lookahead construct is a pair of parentheses, with the opening
parenthesis followed by a question mark and an equals sign.
But that definition doesn't really fit my example. What makes the Positive Lookahead work but not the Non-capturing group work in the example I provide?
Capturing and matching are two different things. (?:expr) doesn't capture expr, but it's still included in the matched string. Zero-width assertions, e.g. (?=expr), don't capture or include expr in the matched string.
Perhaps some examples will help illustrate the difference:
> "abcdef"[/abc(def)/] # => abcdef
> $1 # => def
> "abcdef"[/abc(?:def)/] # => abcdef
> $1 # => nil
> "abcdef"[/abc(?=def)/] # => abc
> $1 # => nil
When you use a non-capturing group in your String#gsub call, it's still part of the match, and gets replaced by the replacement string.
Your first example doesn't work because a non-capturing group is still part of the overall capture, whereas the lookbehind is only used for matching but isn't part of the overall capture.
This is easier to understand if you get the actual match data:
# Non-capturing group
/(?:.+)(_.+)/.match 'john_1'
=> #<MatchData "john_1" 1:"_1">
# Positive Lookbehind
/(?=.+)(_.+)/.match 'john_1'
=> #<MatchData "_1" 1:"_1">
EDIT: I should also mention that sub and gsub work on the entire capture, not individual capture groups (although those can be used in the replacement).
'john_1'.gsub(/(?:.+)(_.+)/, 'phil\1')
=> "phil_1"
Let's consider a couple of situations.
The string preceding the underscore must be "john" and the underscore is followed by one or more characters
str = "john_1"
You have two choices.
Use a positive lookbehind
str[/(?<=john)_.+/]
#=> "_1"
The positive lookbehind requires that "john" must appear immediately before the underscore, but it is not part of the match that is returned.
Use a capture group:
str[/john(_.+)/, 1]
#=> "_1"
This regular expression matches "john_1", but "_.+" is captured in capture group 1. By examining the doc for the method String#[] you will see that one form of the method is str[regexp, capture], which returns the contents of the capture group capture. Here capture equals 1, meaning the first capture group.
Note that the string following the underscore may contain underscores: "john_1_a"[/(?<=john)_.+/] #=> "_1_a".
If the underscore can be at the end of the string replace + with * in the above regular expressions (meaning match zero or more characters after the underscore).
The string preceding the underscore can be anything and and the underscore is followed by one or more characters
str = "john_mary_tom_julie"
We may consider two cases.
The string returned is to begin with the first underscore
In this case we could write:
str[/_.+/]
#=> "_mary_tom_julie"
This works because the regex is by default greedy, meaning it will begin at the first underscore encountered.
The string returned is to begin with the last underscore
Here we could write:
str[/_[^_]+\z/]
#=> "_julie"
This regex matches an underscore followed by one or more characters that are not underscores, followed by the end-of-string anchor (\z).
Aside: the method String#[]
[] may seem an odd name for a method but it is a method nevertheless, so it can be invoked in the conventional way:
str.[](/john(_.+)/, 1)
#=> "_1"
The expression str[/john(_.+)/, 1] is an example (of which there are many in Ruby) of syntactic sugar. When written str[...] Ruby converts it to the conventional expression for methods before evaluating it.

Ruby regular expression non capture group

I'm trying to grab id number from the string, say
id/number/2000GXZ2/ref=sr
using
(?:id\/number\/)([a-zA-Z0-9]{8})
for some reason non capture group is not worked, giving me:
id/number/2000GXZ2
As mentioned by others, non-capturing groups still count towards the overall match. If you don't want that part in your match use a lookbehind.
Rubular example
(?<=id\/number\/)([a-zA-Z0-9]{8})
(?<=pat) - Positive lookbehind assertion: ensures that the preceding characters match pat, but doesn't include those characters in the matched text
Ruby Doc Regexp
Also, the capture group around the id number is unnecessary in this case.
You have:
str = "id/number/2000GXZ2/ref=sr"
r = /
(?:id\/number\/) # match string in a non-capture group
([a-zA-Z0-9]{8}) # match character in character class 8 times, in capture group 1
/x # extended/free-spacing regex definition mode
Then (using String#[]):
str[r]
#=> "id/number/2000GXZ2"
returns the entire match, as it should, not just the contents of capture group 1. There are a few ways to remedy this. Consider first ones that do not use a capture group.
#jacob.m suggested putting the first part in a positive lookbehind (modified slightly from his code):
r = /
(?<=id\/number\/) # match string in positive lookbehind
[[:alnum:]]{8} # match >= 1 alphameric characters
/x
str[r]
#=> "2000GXZ2"
An alternative is:
r = /
id\/number\/ # match string
\K # forget everything matched so far
[[:alnum:]]{8} # match 8 alphanumeric characters
/x
str[r]
#=> "2000GXZ2"
\K is especially useful when the match to forget is variable-length, as (in Ruby) positive lookbehinds do not work with variable-length matches.
With both of these approaches, if the part to be matched contains only numbers and capital letters, you may want to use [A-Z0-9]+ instead of [[:alnum:]] (though the latter includes Unicode letters, not just those from the English alphabet). In fact, if all the entries have the form of your example, you might be able to use:
r = /
\d # match a digit
[A-Z0-9]{7} # match >= 0 capital letters or digits
/x
str[r]
#=> "2000GXZ2"
The other line of approach is to keep your capture group. One simple way is:
r = /
id\/number\/ # match string
([[:alnum:]]{8}) # match >= 1 alphameric characters in capture group 1
/x
str =~ r
str[r, 1] #=> "2000GXZ2"
Alternatively, you could use String#sub to replace the entire string with the contents of the capture group:
r = /
id\/number\/ # match string
([[:alnum:]]{8}) # match >= 1 alphameric characters in capture group 1
.* # match the remainder of the string
/x
str.sub(r, '\1') #=> "2000GXZ2"
str.sub(r, "\\1") #=> "2000GXZ2"
str.sub(r) { $1 } #=> "2000GXZ2"
This is Ruby Regexp expected match consistency evilness. Some Regexp-style methods will return the global-match while others will return specified matches.
In this case, one method we can use to get the behavior you're looking for is scan.
I don't think anyone here actually mentions how to get your Regexp working as you originally intended, which was to get the capture-only match. To do that, you would use the scan method like so with your original pattern:
test_me.rb
test_string="id/number/2000GXZ2/ref=sr"
result = test_string.scan(/(?:id\/number\/)([a-zA-Z0-9]{8})/)
puts result
2000GXZ2
That said, replacing (?:) with (?<=) for non-capture groups for look-behinds will benefit you both when you use scan as well as other parts of ruby that use Regexps.

Ruby search a string for matching character pairs

I want to match character pairs in a string. Let's say the string is:
"zttabcgqztwdegqf". Both "zt" and "gq" are matching pairs of characters in the string.
The following code finds the "zt" matching pair, but not the "gq" pair:
#!/usr/bin/env ruby
string = "zttabcgqztwdegqf"
puts string.scan(/.{1,2}/).detect{ |c| string.count(c) > 1 }
The code provides matching pairs where the indices of the pairs are 0&1,2&3,4&5... but not 1&2,3&4,5&6, etc:
zt
ta
bc
gq
zt
wd
eg
qf
I'm not sure regex in Ruby is the best way to go. But I want to use Ruby for the solution.
You can do your search with a single regex:
puts string.scan(/(?=(.{2}).*\1)/)
regex101 demo
Output
zt
gq
Regex Breakout
(?= # Start a lookahead
(.{2}) # Search any couple of char and group it in \1
.*\1 # Search ahead in the string for another \1 to validate
) # Close lookahead
Note
Putting all the checks inside lookahead assure the regex engine does not consume the couple when validates it.
So it also works with overlapping couples like in the string abcabc: the output will correctly be ab,bc.
Oddity
If the regex engine does not consume the chars how it can reach the end of the string?
Internally after the check Onigmo (the ruby regex engine) makes one step further automatically. Most regex flavours behaves in this way but e.g. the javascript engine needs the programmer to increment the last match index manually.
str = "ztcabcgqzttwtcdegqf"
r = /
(.) # match any character in capture group 1
(?= # begin a positive lookahead
(.) # match any character in capture group 2
.+ # match >= 1 characters
\1 # match capture group 1
\2 # match capture group 2
) # close positive lookahead
/x # extended/free-spacing regex definition mode
str.scan(r).map(&:join)
#=> ["zt", "tc", "gq"]
Here is one way to do this without using regex:
string = "zttabcgqztwdegqf"
p string.split('').each_cons(2).map(&:join).select {|i| string.scan(i).size > 1 }.uniq
#=> ["zt", "gq"]

Why is this negative look behind wrong?

def get_hashtags(post)
tags = []
post.scan(/(?<![0-9a-zA-Z])(#+)([a-zA-Z]+)/){|x,y| tags << y}
tags
end
Test.assert_equals(get_hashtags("two hashs##in middle of word#"), [])
#Expected: [], instead got: ["in"]
Should it not look behind to see if the match doesnt begin with a word or number? Why is it still accepting 'in' as a valid match?
You should use \K rather than a negative lookbehind. That allows you to simplify your regex considerably: no need for a pre-defined array, capture groups or a block.
\K means "discard everything matched so far". The key here is that variable-length matches can precede \K, whereas (in Ruby and most other languages) variable-length matches are not permitted in (negative or positive) lookbehinds.
r = /
[^0-9a-zA-Z#] # do not match any character in the character class
\#+ # match one or more pound signs
\K # discard everything matched so far
[a-zA-Z]+ # match one or more letters
/x # extended mode
Note # in \#+ need not be escaped if I weren't writing the regex in extended mode.
"two hashs##in middle of word#".scan r
#=> []
"two hashs&#in middle of word#".scan r
#=> ["in"]
"two hashs#in middle of word&#abc of another word.###def ".scan r
#=> ["abc", "def"]

Use regular expression to fetch 3 groups from string

This is my expected result.
Input a string and get three returned string.
I have no idea how to finish it with Regex in Ruby.
this is my roughly idea.
match(/(.*?)(_)(.*?)(\d+)/)
Input and expected output
# "R224_OO2003" => R224, OO, 2003
# "R2241_OOP2003" => R2244, OOP, 2003
If the example description I gave in my comment on the question is correct, you need a very straightforward regex:
r = /(.+)_(.+)(\d{4})/
Then:
"R224_OO2003".scan(r).flatten #=> ["R224", "OO", "2003"]
"R2241_OOP2003".scan(r).flatten #=> ["R2241", "OOP", "2003"]
Assuming that your three parts consist of (R and one or more digits), then an underbar, then (one or more non-whitespace characters), before finally (a 4-digit numeric date), then your regex could be something like this:
^(R\d+)_(\S+)(\d{4})$
The ^ indicates start of string, and the $ indicates end of string. \d+ indicates one or more digits, while \S+ says one or more non-whitespace characters. The \d{4} says exactly four digits.
To recover data from the matches, you could either use the pre-defined globals that line up with your groups, or you could could use named captures.
To use the match globals just use $1, $2, and $3. In general, you can figure out the number to use by counting the left parentheses of the specific group.
To use the named captures, include ? right after the left paren of a particular group. For example:
x = "R2241_OOP2003"
match_data = /^(?<first>R\d+)_(?<second>\S+)(?<third>\d{4})$/.match(x)
puts match_data['first'], match_data['second'], match_data['third']
yields
R2241
OOP
2003
as expected.
As long as your pattern covers all possibilities, then you just need to use the match object to return the 3 strings:
my_match = "R224_OO2003".match(/(.*?)(_)(.*?)(\d+)/)
#=> #<MatchData "R224_OO2003" 1:"R224" 2:"_" 3:"OO" 4:"2003">
puts my_match[0] #=> "R224_OO2003"
puts my_match[1] #=> "R224"
puts my_match[2] #=> "_"
puts my_match[3] #=> "00"
puts my_match[4] #=> "2003"
A MatchData object contains an array of each match group starting at index [1]. As you can see, index [0] returns the entire string. If you don't want the capture the "_" you can leave it's parentheses out.
Also, I'm not sure you are getting what you want with the part:
(.*?)
this basically says one or more of any single character followed by zero or one of any single character.

Resources