Match exact word in string ruby regex - ruby

I have this string in ruby and I'm trying to make a match for sin
"please don't share this: 234-604-142"
"please don't share this: 234-604-1421"
so far I have
\d{3}-\d{3}-\d{3}
but this will also match the second case. Adding a ^ and $ for begins and ends will cause this not to function at all.
To fix this I could do this:
x = "please don't share this: 234-604-1421"
x.split.last ~= /^\d{3}-\d{3}-\d{3}$/
but is there a way to match the sin otherwise?

Another option worth considering for your rule is
(?<!\d)\d{3}-\d{3}-\d{3}(?!\d)
That way, if for any reason your phone number is followed or preceded by letters, as in
"please don't share this number234-604-142because it's private"
the regex will still work.
Explain Regex
(?<! # look behind to see if there is not:
\d # digits (0-9)
) # end of look-behind
\d{3} # digits (0-9) (3 times)
- # '-'
\d{3} # digits (0-9) (3 times)
- # '-'
\d{3} # digits (0-9) (3 times)
(?! # look ahead to see if there is not:
\d # digits (0-9)
) # end of look-ahead

Related

Ruby regex get the word combo separated by period

I'm trying to use Ruby regex to get word combo like below.
In a example below I only need cases 1-4, * marked them in caps for easy testing. Word in the middle (dbo, bcd) could be anything or nothing like in case#3. I have trouble how to get that double period case#3 working. It's also good to get standalone SALES as word too but probably it's too much for one regex ?Tx all guru .
This is my script which partially working, need add alpha..SALES
s = '1 alpha.dbo.SALES 2 alpha.bcd.SALES 3 alpha..SALES 4 SALES
bad cases 5x alpha.saleS 6x saleSXX'
regex = /alpha+\.+[a-z]+\.?sales/ix
puts 'R: ' + s.scan(regex).to_s
##R: ["alpha.dbo.SALES", "alpha.bcd.SALES"]
s = '1 alpha.dbo.SALES 2 alpha.bcd.SALES 3 alpha..SALES 4 SALES
bad cases 5x alpha.saleS 6x saleSXX 7x alpha.abc.SALES.etc'
regex = /(?<=^|\s)(?:alpha\.[a-z]*\.)?(?:sales)(?=\s|$)/i
puts 'R: ' + s.scan(regex).to_s
Output:
R: ["alpha.dbo.SALES", "alpha.bcd.SALES", "alpha..SALES", "SALES"]
r = /
(?<=\d[ ]) # match a digit followed by a space in a positive lookbehind
(?: # begin a non-capture group
\p{Alpha}+ # match one or more letters
\. # match a period
(?: # begin a non-capture group
\p{Alpha}+ # match one or more letters
\. # match a period
| # or
\. # match a period
) # end non-capture group
)? # end non-capture group and optionally match it
SALES # match string
(?!=[.\p{Alpha}]) # do not match a period or letter (negative lookahead)
/x # free-spacing regex definition mode.
s.scan(r)
#=> ["alpha.dbo.SALES", "alpha.bcd.SALES", "alpha..SALES", "SALES"]
This regular expression is customarily written as follows.
r = /
(?<=\d )(?:\p{Alpha}+\.(?:\p{Alpha}+\.|\.))?SALES(?!=[.\p{Alpha}])/
In free-spacing mode the space must be put in a character class ([ ]); else it would be stripped out.

Improve my Regex to include numbers that contain decimals and percentage signs

I have the following regex which will capture the first N words and finish at the next period, exclamation point or question mark. I need to get chunks of texts that vary in the number of words but I want complete sentences.
regex = (?:\w+[.?!]?\s+){10}(?:\w+,?\s+)*?\w+[.?!]
It works with the following text:
Therapy extract straw and chitosan from shrimp shells alone
accounted for 2, 4, 6, 8 and 10% found that the extract straw 8% is
highly effective in inhibiting the growth of algae Microcystis spp.
The number of cells and the amount of chlorophyll a was reduced during
treatment. Both value decreased continuous until the end of the trial.
https://regex101.com/r/ardIQ7/5
However it won't work with the following text:
Therapy extract straw and chitosan from shrimp shells alone accounted
for 2, 4, 6, 8 and 10% found that the extract straw 8.2% is highly
effective in inhibiting the growth of algae Microcystis spp. The
number of cells and the amount of chlorophyll a was reduced during
treatment. Both value decreased continuous until the end of the trial.
That is because of the digits (8.2%) with decimals and %.
I have been trying to figure out how to also capture these items but need some assistance to point me in the right direction. I don't just want to capture the first sentence. I want to capture N words which may include several sentences and returns complete sentences.
r = /
(?: # begin a non-capture group
(?: # begin a non-capture group
\p{Alpha}+ # match one or more letters
| # or
\-? # optionally match a minus sign
(?: # begin non-capture group
\d+ # match one or more digits
| # or
\d+ # match one or more digits
\. # match a decimal point
\d+ # match one or more digits
) # end non-capture group
%? # optionally match a percentage character
) # end non-capture group
[,;:.!?]? # optionally ('?' following ']') match a punctuation char
[ ]+ # match one or more spaces
) # end non-capture group
{9,}? # execute the preceding non-capture group at least 14 times, lazily ('?')
(?: # begin a non-capture group
\p{Alpha}+ # match one or more letters
| # or
\-? # optionally match a minus sign
(?: # begin non-capture group
\d+ # match one or more digits
| # or
\d+ # match one or more digits
\. # match a decimal point
\d+ # match one or more digits
) # end non-capture group
%? # optionally match a percentage character
) # end non-capture group
[.!?] # match one of the three punctuation characters
(?!\S) # negative look-ahead: do not match a non-whitespace char
/x # free-spacing regex definition mode
Let text equal the paragraph you wish to examine ("Therapy extract straw...end of the trial.")
Then
text[r]
#=> "Therapy extract straw and chitosan from...the growth of algae Microcystis spp."
We can simplify the construction of the regex (and avoid duplicate bits) as follows.
def construct_regex(min_nbr_words)
common_bits = /(?:\p{Alpha}+|\-?(?:\d+|\d+\.\d+)%?)/
/(?:#{common_bits}[,;:.!?]? +){#{min_nbr_words},}?#{common_bits}[.!?](?!\S)/
end
r = construct_regex(10)
#=> /(?:(?-mix:\p{Alpha}+|\-?(?:\d+|\d+\.\d+)%?)[,;:.!?]? +){10,}?(?-mix:\p{Alpha}+|\-?(?:\d+|\d+\.\d+)%?)[.!?](?!\S)/
This regex could be simplified if it were permitted to match nonsense words such as "ab2.3e%" or "2.3.2%". As presently defined, the regex will not match such words.
Try this, (?:\S+[,.?!]?\s+){1,200}[\s\S]*?(\. |!|\?)
This will match the N number of characters.
If the Nth character didn't end a sentence, then it will match until the previous sentence. The N should be mentioned as {1, N}
Regex

Ruby parsing and regex

Picked up Ruby recently and have been fiddling around with it. I wanted to learn how to use regex or other Ruby tricks to check for certain words, whitespace characters, valid format etc in a given text line.
Let's say I have an order list that looks strictly like this in this format:
cost: 50 items: book,lamp
One space after semicolon, no space after each comma, no trailing whitespaces at the end and stuff like that.
How can I check for errors in this format using Ruby? This for example should fail my checks:
cost: 60 items:shoes,football
My goal was to split the string by a " " and check to see if the first word was "cost:", if the second word was a number and so on but I realized that splitting on a " " doesn't help me check for extra whitespaces as it just eats it up. Also doesn't help me check for trailing whitespaces. How do I go about doing this?
You could use the following regular expression.
r = /
\A # match beginning of string
cost:\s # match "cost:" followed by a space
\d+\s # match > 0 digits followed by a space
items:\s # match "items:" followed by a space
[[:alpha:]]+ # match > 0 lowercase or uppercase letters
(?:,[[:alpha:]]+) # match a comma followed by > 0 lowercase or uppercase
# letters in a non-capture group (?: ... )
* # perform the match on non-capture group >= 0 times
\z # match the end of the string
/x # free-spacing regex definition mode
"cost: 50 items: book,lamp" =~ r #=> 0 (a match, beginning at index 0)
"cost: 50 items: book,lamp,table" =~ r #=> 0 (a match, beginning at index 0)
"cost: 60 items:shoes,football" =~ r #=> nil (no match)
The regex can can of course be written in the normal manner:
r = /\Acost:\s\d+\sitems:\s[[:alpha:]]+(?:,[[:alpha:]]+)*\z/
or
r = /\Acost: \d+ items: [[:alpha:]]+(?:,[[:alpha:]]+)*\z/
though a whitespace character (\s) cannot be replaced by a space in the free-spacing mode definition (\x).

Need a regex pattern to match date with optional time

I need a regex pattern which matches a date with optional time.
The date should be a valid U.S. date in m/d/yyyy format. The time should be h:mm:ss am/pm or 24-hour time hh:mm:ss.
Matches: 9/1/2011 | 9/1/2011 10:00 am | 9/1/2011 10:00 AM | 9/1/2011 10:00:00
This pattern will be used in a Ruby on Rails project, so it should be in a format usable via Ruby. See http://rubular.com/ for testing.
Here's my existing date pattern (which may be an over-kill):
DATE_PATTERN = /^((((0[13578])|([13578])|(1[02]))[\/](([1-9])|([0-2][0-9])|(3[01])))|(((0[469])|([469])|(11))[\/](([1-9])|([0-2][0-9])|(30)))|((2|02)[\/](([1-9])|([0-2][0-9]))))[\/]\d{4}$|^\d{4}/
Regular expressions are horrible for this kind of job. If you're using Ruby I'd recommend using DateTime.strptime to parse the data and check its validity:
def validate_date(date_str)
valid_formats = ["%m/%d/%Y", "%m/%d/%Y %I:%M %P"]
#see http://www.ruby-doc.org/core-1.9.3/Time.html#method-i-strftime for more
valid_formats.each do |format|
valid = Time.strptime(date_str, format) rescue false
return true if valid
end
return false
end
Well, here's what I ended up with; using stricter military time:
DATE_TIME_FORMAT = /^([0,1]?\d{1})\/([0-2]?\d{1}|[3][0,1]{1})\/([1]{1}[9]{1}[9]{1}\d{1}|[2-9]{1}\d{3})\s([0]?\d|1\d|2[0-3]):([0-5]\d):([0-5]\d)$/
Matches: 1/19/2011 23:59:59
Captures:
1
19
2011
23
59
59
if subject =~ /\A(?:0?[1-9]|1[012])\/(?:0?[1-9]|[12]\d|3[01])\/(?:\d{4})(?:\s+(?:(?:[01]?\d|2[0-3]):(?:[0-5]\d)|(?:0?\d|1[0-2]):(?:[0-5]\d)\s+[ap]m))?\s*\Z/i
# Successful match
Good luck..
How it works :
"
^ # Assert position at the beginning of the string
(?: # Match the regular expression below
# Match either the regular expression below (attempting the next alternative only if this one fails)
0 # Match the character “0” literally
? # Between zero and one times, as many times as possible, giving back as needed (greedy)
[1-9] # Match a single character in the range between “1” and “9”
| # Or match regular expression number 2 below (the entire group fails if this one fails to match)
1 # Match the character “1” literally
[012] # Match a single character present in the list “012”
)
/ # Match the character “/” literally
(?: # Match the regular expression below
# Match either the regular expression below (attempting the next alternative only if this one fails)
0 # Match the character “0” literally
? # Between zero and one times, as many times as possible, giving back as needed (greedy)
[1-9] # Match a single character in the range between “1” and “9”
| # Or match regular expression number 2 below (attempting the next alternative only if this one fails)
[12] # Match a single character present in the list “12”
\d # Match a single digit 0..9
| # Or match regular expression number 3 below (the entire group fails if this one fails to match)
3 # Match the character “3” literally
[01] # Match a single character present in the list “01”
)
/ # Match the character “/” literally
(?: # Match the regular expression below
\d # Match a single digit 0..9
{4} # Exactly 4 times
)
(?: # Match the regular expression below
\s # Match a single character that is a “whitespace character” (spaces, tabs, line breaks, etc.)
+ # Between one and unlimited times, as many times as possible, giving back as needed (greedy)
(?: # Match the regular expression below
# Match either the regular expression below (attempting the next alternative only if this one fails)
(?: # Match the regular expression below
# Match either the regular expression below (attempting the next alternative only if this one fails)
[01] # Match a single character present in the list “01”
? # Between zero and one times, as many times as possible, giving back as needed (greedy)
\d # Match a single digit 0..9
| # Or match regular expression number 2 below (the entire group fails if this one fails to match)
2 # Match the character “2” literally
[0-3] # Match a single character in the range between “0” and “3”
)
: # Match the character “:” literally
(?: # Match the regular expression below
[0-5] # Match a single character in the range between “0” and “5”
\d # Match a single digit 0..9
)
| # Or match regular expression number 2 below (the entire group fails if this one fails to match)
(?: # Match the regular expression below
# Match either the regular expression below (attempting the next alternative only if this one fails)
0 # Match the character “0” literally
? # Between zero and one times, as many times as possible, giving back as needed (greedy)
\d # Match a single digit 0..9
| # Or match regular expression number 2 below (the entire group fails if this one fails to match)
1 # Match the character “1” literally
[0-2] # Match a single character in the range between “0” and “2”
)
: # Match the character “:” literally
(?: # Match the regular expression below
[0-5] # Match a single character in the range between “0” and “5”
\d # Match a single digit 0..9
)
\s # Match a single character that is a “whitespace character” (spaces, tabs, line breaks, etc.)
+ # Between one and unlimited times, as many times as possible, giving back as needed (greedy)
[ap] # Match a single character present in the list “ap”
m # Match the character “m” literally
)
)? # Between zero and one times, as many times as possible, giving back as needed (greedy)
\s # Match a single character that is a “whitespace character” (spaces, tabs, line breaks, etc.)
* # Between zero and unlimited times, as many times as possible, giving back as needed (greedy)
$ # Assert position at the end of the string (or before the line break at the end of the string, if any)
"
Remember is not the spoon that bents but you!
Here's what I came up with that seems to work:
regex = /^1?\d{1}\/[123]?\d{1}\/\d{4}(\s[12]?\d:[0-5]\d(:[0-5]\d)?(\s[ap]m)?)?$/

Ruby regex extracting words

I'm currently struggling to come up with a regex that can split up a string into words where words are defined as a sequence of characters surrounded by whitespace, or enclosed between double quotes. I'm using String#scan
For instance, the string:
' hello "my name" is "Tom"'
should match the words:
hello
my name
is
Tom
I managed to match the words enclosed in double quotes by using:
/"([^\"]*)"/
but I can't figure out how to incorporate the surrounded by whitespace characters to get 'hello', 'is', and 'Tom' while at the same time not screw up 'my name'.
Any help with this would be appreciated!
result = ' hello "my name" is "Tom"'.split(/\s+(?=(?:[^"]*"[^"]*")*[^"]*$)/)
will work for you. It will print
=> ["", "hello", "\"my name\"", "is", "\"Tom\""]
Just ignore the empty strings.
Explanation
"
\\s # Match a single character that is a “whitespace character” (spaces, tabs, and line breaks)
+ # Between one and unlimited times, as many times as possible, giving back as needed (greedy)
(?= # Assert that the regex below can be matched, starting at this position (positive lookahead)
(?: # Match the regular expression below
[^\"] # Match any character that is NOT a “\"”
* # Between zero and unlimited times, as many times as possible, giving back as needed (greedy)
\" # Match the character “\"” literally
[^\"] # Match any character that is NOT a “\"”
* # Between zero and unlimited times, as many times as possible, giving back as needed (greedy)
\" # Match the character “\"” literally
)* # Between zero and unlimited times, as many times as possible, giving back as needed (greedy)
[^\"] # Match any character that is NOT a “\"”
* # Between zero and unlimited times, as many times as possible, giving back as needed (greedy)
\$ # Assert position at the end of a line (at the end of the string or before a line break character)
)
"
You can use reject like this to avoid empty strings
result = ' hello "my name" is "Tom"'
.split(/\s+(?=(?:[^"]*"[^"]*")*[^"]*$)/).reject {|s| s.empty?}
prints
=> ["hello", "\"my name\"", "is", "\"Tom\""]
text = ' hello "my name" is "Tom"'
text.scan(/\s*("([^"]+)"|\w+)\s*/).each {|match| puts match[1] || match[0]}
Produces:
hello
my name
is
Tom
Explanation:
0 or more spaces followed by
either
some words within double-quotes OR
a single word
followed by 0 or more spaces
You can try this regex:
/\b(\w+)\b/
which uses \b to find the word boundary. And this web site http://rubular.com/ is helpful.

Resources