regular expression in ruby for strings with multiple patterns - ruby

i have a string with optional substrings and i was looking/working for/on regular expression with names captures, a single regular expression for all if possible.
in RUBY
Please help,
sample strings:
string1 = bike wash #a simple task
string2 = bike wash # bike point # a simple task with location
string3 = bike wash # bike point on 13 may 11 # task with location and date
string4 = bike wash # bike point on 13 may 11 # 10 AM # task with location, date and time
string5 = bike wash on 13 may 11 # 10 AM # task with date and time without location
string6 = bike wash on 13 may 11 # task and date
i have spent almost a day in google and stackoverflow to get a single regular expression for all the above pattern of strings.

Assumptions:
Location and time start with #, and # appears nowhere else.
Date starts with on surrounded with obligatory white spaces, and on appears nowhere else.
Task is obligatory.
Location and date are optional and independent of one another.
Time appears only when there is date.
Task, location, date, time only appear in this order.
Also, it should be taken for granted that the regex engine is oniguruma since named capture is mentioned.
regex = /
(?<task>.*?)
(?:\s*#\s*(?<location>.*?))?
(?:\s+on\s+(?<date>.*?)
(?:\s*#\s*(?<time>.*))?
)?
\z/x
string4.match(regex)
# => #<MatchData
"bike wash # bike point on 13 may 11 # 10 AM"
task: "bike wash"
location: "bike point"
date: "13 may 11"
time: "10 AM"
>

For regular expression to do this job, some assumptions need to be made. Tasks should not include " # " or " on ", e.g, but there may be more.
To match any character but the first space for " # " or " on ", I'd use (?! # | on ).
So you could find the task using (((?! # | on ).)+). This is followed by an optional location, prefixed with " # ": (?: # ((?:(?! on ).)+))?. Note that the location should not include " on " here.
Following that, there is an optional date with an optional time: (?: on ((?:(?! # ).)+)(?: # (.+))?)?. All together:
((?:(?! # | on ).)+)(?: # ((?:(?! on ).)+))?(?: on ((?:(?! # ).)+)(?: # (.+))?)?
This will have task, location, date and time in the first four capturing groups. See here: http://regexr.com?2tnb3

Related

How to select first 280 words of text up to the closest period?

I need to extract a shorter segment of text of a specified number of words from a longer one. I can do this using
text = "There was a very big cat that was sitting on the ledge. It was overlooking the garden. The dog next door watched with curiosity."
text.split[0..15].join(' ')
>>""There was a very big cat that was sitting on the ledge. It was overlooking"
I would like to select the text up to the next period so I don't end up with a partial sentence.
Is there a way possibly using a Regex to accomplish what I'm trying to do that will be able to get the text up to and including the closest next period after the 15th word?
You can use
(?:\w+[,.?!]?\s+){14}(?:\w+,?\s+)*?\w+[.?!]
Repeats a word, optional [comma/period/question mark/exclamation mark], and spaces, 14 times. Then, it lazy-repeats a word followed by a space, followed by another word and a period, ensuring that the pattern ends at the first period after 15 words from the start.
https://regex101.com/r/ardIQ7/4
r = /
(?: # begin a non-capture group
\p{Alpha}+ # match one or more letters
[.!?]? # optionally ('?' following ']') match one of the 3 punctuation chars
[ ]+ # match one or more spaces
) # end non-capture group
{14,}? # execute the preceding non-capture group at least 14 times, lazily ('?')
\p{Alpha}+ # match one or more letters
[.!?] # match one of the three punctuation characters
/x # free-spacing regex definition mode
text[r]
#=> "There was a very big cat that was sitting on the ledge. It was overlooking
# the garden.
Free-spacing mode strips out spaces, which is why the space character above is in a character class ([ ]+). Written conventionally, the regex is as follows.
/(?:\p{Alpha}+[.!?]? +){14,}?\p{Alpha}+[.!?]/
You can do something along these lines:
text = "There was a very big cat that was sitting on the ledge. It was overlooking the garden. The dog next door watched with curiosity."
tgt=15
old_text=text.scan(/[^.]+\.\s?/)
new_text=[]
while (old_text && new_text.join.scan(/\b\p{Alpha}+\b/).length<=tgt) do
new_text << old_text.shift
end
p new_text.join
Prints:
"There was a very big cat that was sitting on the ledge. It was overlooking the garden. "
That works with any length of normal sentences and will break as soon as 1 additional sentence exceeds the word target.

Ruby regex get the word combo separated by period

I'm trying to use Ruby regex to get word combo like below.
In a example below I only need cases 1-4, * marked them in caps for easy testing. Word in the middle (dbo, bcd) could be anything or nothing like in case#3. I have trouble how to get that double period case#3 working. It's also good to get standalone SALES as word too but probably it's too much for one regex ?Tx all guru .
This is my script which partially working, need add alpha..SALES
s = '1 alpha.dbo.SALES 2 alpha.bcd.SALES 3 alpha..SALES 4 SALES
bad cases 5x alpha.saleS 6x saleSXX'
regex = /alpha+\.+[a-z]+\.?sales/ix
puts 'R: ' + s.scan(regex).to_s
##R: ["alpha.dbo.SALES", "alpha.bcd.SALES"]
s = '1 alpha.dbo.SALES 2 alpha.bcd.SALES 3 alpha..SALES 4 SALES
bad cases 5x alpha.saleS 6x saleSXX 7x alpha.abc.SALES.etc'
regex = /(?<=^|\s)(?:alpha\.[a-z]*\.)?(?:sales)(?=\s|$)/i
puts 'R: ' + s.scan(regex).to_s
Output:
R: ["alpha.dbo.SALES", "alpha.bcd.SALES", "alpha..SALES", "SALES"]
r = /
(?<=\d[ ]) # match a digit followed by a space in a positive lookbehind
(?: # begin a non-capture group
\p{Alpha}+ # match one or more letters
\. # match a period
(?: # begin a non-capture group
\p{Alpha}+ # match one or more letters
\. # match a period
| # or
\. # match a period
) # end non-capture group
)? # end non-capture group and optionally match it
SALES # match string
(?!=[.\p{Alpha}]) # do not match a period or letter (negative lookahead)
/x # free-spacing regex definition mode.
s.scan(r)
#=> ["alpha.dbo.SALES", "alpha.bcd.SALES", "alpha..SALES", "SALES"]
This regular expression is customarily written as follows.
r = /
(?<=\d )(?:\p{Alpha}+\.(?:\p{Alpha}+\.|\.))?SALES(?!=[.\p{Alpha}])/
In free-spacing mode the space must be put in a character class ([ ]); else it would be stripped out.

How to replace Perl-style regex with MatchData object

I am using the gsub method with a regular expression:
#text.gsub(/(-\n)(\S+)\s/) { "#{$2}\n" }
Example of input data:
"The wolverine is now es-
sentially absent from
the southern end
of its European range."
should return:
"The wolverine is now essentially
absent from
the southern end
of its European range."
The method works fine, but rubocop reports and offense:
Avoid the use of Perl-style backrefs.
Any ideas how to rewrite it using MatchData object instead of $2?
If you want to use Regexp.last_match :
#text.gsub(/(-\n)(\S+)\s/) { Regexp.last_match[2] + "\n" }
or :
#text.gsub(/-\n(\S+)\s/) { Regexp.last_match[1] + "\n" }
Note that the block in gsub should be used when logic is involved. Without logic, a second parameter set to "\\1\n" or '\1' + "\n" would do just fine.
You can use backslash without the block:
#text.gsub /(-\n)(\S+)\s/, "\\2\n"
Also, it's a bit cleaner to use only one group, since the first one above isn't needed:
#text.gsub /-\n(\S+)\s/, "\\1\n"
This solution accounts for errant spaces before newlines and split words that end a sentence or the string. It uses String#gsub with a block and no capture groups.
Code
R = /
[[:alpha:]]\- # match a letter followed by a hyphen
\s*\n # match a newline possibly preceded by whitespace
[[:alpha:]]+ # match one or more letters
[.?!]? # possibly match a sentence terminator
\n? # possibly match a newline
\s* # match zero or more whitespaces
/x # free-spacing regex definition mode
def remove_hyphens(str)
str.gsub(R) { |s| s.gsub(/[\n\s-]/, '') << "\n" }
end
Examples
str =<<_
The wolverine is now es-
sentially absent from
the south-
ern end of its
European range.
_
puts remove_hyphens(str)
The wolverine is now essentially
absent from
the southern
end of its
European range.
puts remove_hyphens("now es- \nsentially\nabsent")
now essentially
absent
puts remove_hyphens("now es-\nsentially.\nabsent")
now essentially.
absent
remove_hyphens("now es-\nsentially?\n")
#=> "now essentially?\n" (no extra \n at end)

How to extract portion of a line in ruby?

I have a line say
line = "start running at Sat April 1 07:30:37 2017"
and I want to extract
"Sat April 1 07:30:37 2017"
I tried this...
line = "start running at Sat April 1 07:30:37 2017"
if (line =~ /start running at/)
line.split("start running at ").last
end
... but is there any other way of doing this?
This is a way to extract, from an arbitrary string, a substring that represents a time in the given format. I've assumed there is at most one such substring in the string.
require 'time'
R = /
(?:#{Date::ABBR_DAYNAMES.join('|')})\s
# match day name abbreviation in non-capture group. space
(?:#{Date::MONTHNAMES[1,12].join('|')})\s
# match month name in non-capture group, space
\d{1,2}\s # match one or two digits, space
\d{2}: # match two digits, colon
\d{2}: # match two digits, colon
\d{2}\s # match two digits, space
\d{4} # match 4 digits
(?!\d) # do not match digit (negative lookahead)
/x # free-spacing regex def mode
# /
# (?:Sun|Mon|Tue|Wed|Thu|Fri|Sat)\s
# (?:January|February|March|...|November|December)\s
# \d{1,2}\s
# \d{2}:
# \d{2}:
# \d{2}\s
# \d{4}
# (?!\d)
# /x
def extract_time(str)
s = str[R]
return nil if s.nil?
(DateTime.strptime(s, "%a %B %e %H:%M:%S %Y") rescue nil) ? s : nil
end
str = "start eating breakfast at Sat April 1 07:30:37 2017"
extract_time(str)
#=> "Sat April 1 07:30:37 2017"
str = "go back to sleep at Cat April 1 07:30:37 2017"
extract_time(str)
#=> nil
Alternatively, if there is a match against R, but Time#strptime raises an exception (meaning s is not a valid time for the given time format) one could raise an exception to advise the user.
try
line.sub(/start running at (.*)/, '\1')
The standard way to do this with regular expressions would be:
if md = line.match(/start running at (.*)/)
md[1]
end
But you don't need regular expressions, you can do regular string operations:
prefix = 'start running at '
if line.start_with?(prefix)
line[prefix.size..-1]
end
Here's another (as it turns out, slightly faster) option using #partition:
# will return empty string if there is no match, instead of raising an exception like split.last will
line.partition('start running at ').last
I was interested how this performs against regexp match, so here's a quick benchmark with 1 million executions each:
line.sub(/start running at (.*)/, '\1')
# => #real=1.7465
line.partition('start running at ').last
# => #real=0.712406
# => this is faster, but you'd need to be calling this quite a bit for it to make a significant difference
Bonus: it also makes it really easy to cater for a more general case e.g. if you have lines that start with "start running at" and others that start with "stop running at". Then something like line.partition(' at ').last will cater for both (and actually run slightly faster).
And yet another alternative:
puts $1 if line =~ /start running at (.*)/
The shortest would be line["Sat April 1 07:30:37 2017"] which would return your "Sat April 1 07:30:37 2017" string if present and nil if not.
The [] notation on a String is a shorthand for getting a substring out of the string and can be used with another string or a Regular Expression. See https://ruby-doc.org/core-2.2.0/String.html#method-i-5B-5D
In case the string is unknown you can use this shorthand also like Cary suggested
line[/start running at (.*)/, 1]
In case you want to be sure the date extracted is valid you would need the regular expression from his answer but you still could use this method.

Regular expression to match numbers

I am looking for a regular expression regex to match one of these patterns:
Number followed by x
Number separated by one or more space
I don't know if match is the correct method to achieve the results.
Matching examples:
' 30x '
'30x'
'20 30'
' 20 30 '
'30x'.match(regex).to_a #=> ['30']
'30 40'.match(regex).to_a #=> ['30', '40']
"30".match(regex).to_a # => ["30"]
" 30 ".match(regex).to_a # => ["30"]
"30 40".match(regex).to_a # => ["30", "40"]
Non-matching examples:
'20x 30 '
'x20 '
"30xx".match(regex).to_a # => nil
"30 a".match(regex).to_a # => nil
"30 60x".match(regex).to_a # => nil
"30x 20".match(regex).to_a # => nil
EDIT
Following #TeroTilus advice, this is the use case for this question:
The user will insert how he will pay an debt. Then, we've created a textfield to
easily insert the payment condition. Example:
> "15 20" # Generate 2 bills: First for 15 days and second for 20 days
> "2x" # Generate 2 bills: First for 30 days and second for 60 days
> "2x 30" # Show message of 'Invalid Format'
> "ANY other string" # Show message of 'Invalid Format'
How about:
/^\s*\d+(?:x\s*|\s*\d+)?$/
explanation:
The regular expression:
(?-imsx:^\s*\d+(?:x\s*|\s*\d+)?$)
matches as follows:
NODE EXPLANATION
----------------------------------------------------------------------
(?-imsx: group, but do not capture (case-sensitive)
(with ^ and $ matching normally) (with . not
matching \n) (matching whitespace and #
normally):
----------------------------------------------------------------------
^ the beginning of the string
----------------------------------------------------------------------
\s* whitespace (\n, \r, \t, \f, and " ") (0 or
more times (matching the most amount
possible))
----------------------------------------------------------------------
\d+ digits (0-9) (1 or more times (matching
the most amount possible))
----------------------------------------------------------------------
(?: group, but do not capture (optional
(matching the most amount possible)):
----------------------------------------------------------------------
x 'x'
----------------------------------------------------------------------
\s* whitespace (\n, \r, \t, \f, and " ") (0
or more times (matching the most amount
possible))
----------------------------------------------------------------------
| OR
----------------------------------------------------------------------
\s* whitespace (\n, \r, \t, \f, and " ") (0
or more times (matching the most amount
possible))
----------------------------------------------------------------------
\d+ digits (0-9) (1 or more times (matching
the most amount possible))
----------------------------------------------------------------------
)? end of grouping
----------------------------------------------------------------------
$ before an optional \n, and the end of the
string
----------------------------------------------------------------------
) end of grouping
----------------------------------------------------------------------
Try string.scan(/(^|\s+)(\d+)x?/).map(&:last), it probably does what you want.
This works for me
\d*x\s*$|\d*[^x] \d[^\s]*
I have newer written a single line of ruby so the syntax of the below might be horible
But, the easiest solution to your problem is; first reduce your first case to your second case, then do your matching for numbers.
Something like;
("20x 30".gsub/^\s*(\d+)x\s*$/,'\1').match(/\b\d+\b/)
This should work for the examples you gave.
^(?:\s*(\d+))+x?\s*$
^ # Match start of string
(?: # Open non-capturing group
\s* # Zero or more spaces at start or between numbers
(\d+) # Capture one or more numbers
) # Close the group
+ # Group should appear one or more times
x? # The final group may have an x directly after it
\s* # Zero or more trailing spaces are allowed
$ # Match the end of the string
Edited to capture the numbers, not sure if you were looking to do this or just match the string.

Resources