Need a regex pattern to match date with optional time - ruby

I need a regex pattern which matches a date with optional time.
The date should be a valid U.S. date in m/d/yyyy format. The time should be h:mm:ss am/pm or 24-hour time hh:mm:ss.
Matches: 9/1/2011 | 9/1/2011 10:00 am | 9/1/2011 10:00 AM | 9/1/2011 10:00:00
This pattern will be used in a Ruby on Rails project, so it should be in a format usable via Ruby. See http://rubular.com/ for testing.
Here's my existing date pattern (which may be an over-kill):
DATE_PATTERN = /^((((0[13578])|([13578])|(1[02]))[\/](([1-9])|([0-2][0-9])|(3[01])))|(((0[469])|([469])|(11))[\/](([1-9])|([0-2][0-9])|(30)))|((2|02)[\/](([1-9])|([0-2][0-9]))))[\/]\d{4}$|^\d{4}/

Regular expressions are horrible for this kind of job. If you're using Ruby I'd recommend using DateTime.strptime to parse the data and check its validity:
def validate_date(date_str)
valid_formats = ["%m/%d/%Y", "%m/%d/%Y %I:%M %P"]
#see http://www.ruby-doc.org/core-1.9.3/Time.html#method-i-strftime for more
valid_formats.each do |format|
valid = Time.strptime(date_str, format) rescue false
return true if valid
end
return false
end

Well, here's what I ended up with; using stricter military time:
DATE_TIME_FORMAT = /^([0,1]?\d{1})\/([0-2]?\d{1}|[3][0,1]{1})\/([1]{1}[9]{1}[9]{1}\d{1}|[2-9]{1}\d{3})\s([0]?\d|1\d|2[0-3]):([0-5]\d):([0-5]\d)$/
Matches: 1/19/2011 23:59:59
Captures:
1
19
2011
23
59
59

if subject =~ /\A(?:0?[1-9]|1[012])\/(?:0?[1-9]|[12]\d|3[01])\/(?:\d{4})(?:\s+(?:(?:[01]?\d|2[0-3]):(?:[0-5]\d)|(?:0?\d|1[0-2]):(?:[0-5]\d)\s+[ap]m))?\s*\Z/i
# Successful match
Good luck..
How it works :
"
^ # Assert position at the beginning of the string
(?: # Match the regular expression below
# Match either the regular expression below (attempting the next alternative only if this one fails)
0 # Match the character “0” literally
? # Between zero and one times, as many times as possible, giving back as needed (greedy)
[1-9] # Match a single character in the range between “1” and “9”
| # Or match regular expression number 2 below (the entire group fails if this one fails to match)
1 # Match the character “1” literally
[012] # Match a single character present in the list “012”
)
/ # Match the character “/” literally
(?: # Match the regular expression below
# Match either the regular expression below (attempting the next alternative only if this one fails)
0 # Match the character “0” literally
? # Between zero and one times, as many times as possible, giving back as needed (greedy)
[1-9] # Match a single character in the range between “1” and “9”
| # Or match regular expression number 2 below (attempting the next alternative only if this one fails)
[12] # Match a single character present in the list “12”
\d # Match a single digit 0..9
| # Or match regular expression number 3 below (the entire group fails if this one fails to match)
3 # Match the character “3” literally
[01] # Match a single character present in the list “01”
)
/ # Match the character “/” literally
(?: # Match the regular expression below
\d # Match a single digit 0..9
{4} # Exactly 4 times
)
(?: # Match the regular expression below
\s # Match a single character that is a “whitespace character” (spaces, tabs, line breaks, etc.)
+ # Between one and unlimited times, as many times as possible, giving back as needed (greedy)
(?: # Match the regular expression below
# Match either the regular expression below (attempting the next alternative only if this one fails)
(?: # Match the regular expression below
# Match either the regular expression below (attempting the next alternative only if this one fails)
[01] # Match a single character present in the list “01”
? # Between zero and one times, as many times as possible, giving back as needed (greedy)
\d # Match a single digit 0..9
| # Or match regular expression number 2 below (the entire group fails if this one fails to match)
2 # Match the character “2” literally
[0-3] # Match a single character in the range between “0” and “3”
)
: # Match the character “:” literally
(?: # Match the regular expression below
[0-5] # Match a single character in the range between “0” and “5”
\d # Match a single digit 0..9
)
| # Or match regular expression number 2 below (the entire group fails if this one fails to match)
(?: # Match the regular expression below
# Match either the regular expression below (attempting the next alternative only if this one fails)
0 # Match the character “0” literally
? # Between zero and one times, as many times as possible, giving back as needed (greedy)
\d # Match a single digit 0..9
| # Or match regular expression number 2 below (the entire group fails if this one fails to match)
1 # Match the character “1” literally
[0-2] # Match a single character in the range between “0” and “2”
)
: # Match the character “:” literally
(?: # Match the regular expression below
[0-5] # Match a single character in the range between “0” and “5”
\d # Match a single digit 0..9
)
\s # Match a single character that is a “whitespace character” (spaces, tabs, line breaks, etc.)
+ # Between one and unlimited times, as many times as possible, giving back as needed (greedy)
[ap] # Match a single character present in the list “ap”
m # Match the character “m” literally
)
)? # Between zero and one times, as many times as possible, giving back as needed (greedy)
\s # Match a single character that is a “whitespace character” (spaces, tabs, line breaks, etc.)
* # Between zero and unlimited times, as many times as possible, giving back as needed (greedy)
$ # Assert position at the end of the string (or before the line break at the end of the string, if any)
"
Remember is not the spoon that bents but you!

Here's what I came up with that seems to work:
regex = /^1?\d{1}\/[123]?\d{1}\/\d{4}(\s[12]?\d:[0-5]\d(:[0-5]\d)?(\s[ap]m)?)?$/

Related

Ruby regex get the word combo separated by period

I'm trying to use Ruby regex to get word combo like below.
In a example below I only need cases 1-4, * marked them in caps for easy testing. Word in the middle (dbo, bcd) could be anything or nothing like in case#3. I have trouble how to get that double period case#3 working. It's also good to get standalone SALES as word too but probably it's too much for one regex ?Tx all guru .
This is my script which partially working, need add alpha..SALES
s = '1 alpha.dbo.SALES 2 alpha.bcd.SALES 3 alpha..SALES 4 SALES
bad cases 5x alpha.saleS 6x saleSXX'
regex = /alpha+\.+[a-z]+\.?sales/ix
puts 'R: ' + s.scan(regex).to_s
##R: ["alpha.dbo.SALES", "alpha.bcd.SALES"]
s = '1 alpha.dbo.SALES 2 alpha.bcd.SALES 3 alpha..SALES 4 SALES
bad cases 5x alpha.saleS 6x saleSXX 7x alpha.abc.SALES.etc'
regex = /(?<=^|\s)(?:alpha\.[a-z]*\.)?(?:sales)(?=\s|$)/i
puts 'R: ' + s.scan(regex).to_s
Output:
R: ["alpha.dbo.SALES", "alpha.bcd.SALES", "alpha..SALES", "SALES"]
r = /
(?<=\d[ ]) # match a digit followed by a space in a positive lookbehind
(?: # begin a non-capture group
\p{Alpha}+ # match one or more letters
\. # match a period
(?: # begin a non-capture group
\p{Alpha}+ # match one or more letters
\. # match a period
| # or
\. # match a period
) # end non-capture group
)? # end non-capture group and optionally match it
SALES # match string
(?!=[.\p{Alpha}]) # do not match a period or letter (negative lookahead)
/x # free-spacing regex definition mode.
s.scan(r)
#=> ["alpha.dbo.SALES", "alpha.bcd.SALES", "alpha..SALES", "SALES"]
This regular expression is customarily written as follows.
r = /
(?<=\d )(?:\p{Alpha}+\.(?:\p{Alpha}+\.|\.))?SALES(?!=[.\p{Alpha}])/
In free-spacing mode the space must be put in a character class ([ ]); else it would be stripped out.

Regex: Match all hyphens or underscores not at the beginning or the end of the string

I am writing some code that needs to convert a string to camel case. However, I want to allow any _ or - at the beginning of the code.
I have had success matching up an _ character using the regex here:
^(?!_)(\w+)_(\w+)(?<!_)$
when the inputs are:
pro_gamer #matched
#ignored
_proto
proto_
__proto
proto__
__proto__
#matched as nerd_godess_of, skyrim
nerd_godess_of_skyrim
I recursively apply my method on the first match if it looks like nerd_godess_of.
I am having troubled adding - matches to the same, I assumed that just adding a - to the mix like this would work:
^(?![_-])(\w+)[_-](\w+)(?<![_-])$
and it matches like this:
super-mario #matched
eslint-path #matched
eslint-global-path #NOT MATCHED.
I would like to understand why the regex fails to match the last case given that it worked correctly for the _.
The (almost) full set of test inputs can be found here
The fact that
^(?![_-])(\w+)[_-](\w+)(?<![_-])$
does not match the second hyphen in "eslint-global-path" is because of the anchor ^ which limits the match to be on the first hyphen only. This regex reads, "Match the beginning of the line, not followed by a hyphen or underscore, then match one or more words characters (including underscores), a hyphen or underscore, and then one or more word characters in a capture group. Lastly, do not match a hyphen or underscore at the end of the line."
The fact that an underscore (but not a hyphen) is a word (\w) character completely messes up the regex. In general, rather than using \w, you might want to use \p{Alpha} or \p{Alnum} (or POSIX [[:alpha:]] or [[:alnum:]]).
Try this.
r = /
(?<= # begin a positive lookbehind
[^_-] # match a character other than an underscore or hyphen
) # end positive lookbehind
( # begin capture group 1
(?: # begin a non-capture group
-+ # match one or more hyphens
| # or
_+ # match one or more underscores
) # end non-capture group
[^_-] # match any character other than an underscore or hyphen
) # end capture group 1
/x # free-spacing regex definition mode
'_cats_have--nine_lives--'.gsub(r) { |s| s[-1].upcase }
#=> "_catsHaveNineLives--"
This regex is conventionally written as follows.
r = /(?<=[^_-])((?:-+|_+)[^_-])/
If all the letters are lower case one could alternatively write
'_cats_have--nine_lives--'.split(/(?<=[^_-])(?:_+|-+)(?=[^_-])/).
map(&:capitalize).join
#=> "_catsHaveNineLives--"
where
'_cats_have--nine_lives--'.split(/(?<=[^_-])(?:_+|-+)(?=[^_-])/)
#=> ["_cats", "have", "nine", "lives--"]
(?=[^_-]) is a positive lookahead that requires the characters on which the split is made to be followed by a character other than an underscore or hyphen
you can try the regex
^(?=[^-_])(\w+[-_]\w*)+(?=[^-_])\w$
see the demo here.
Switch _- to -_ so that - is not treated as a range op, as in a-z.

Ruby parsing and regex

Picked up Ruby recently and have been fiddling around with it. I wanted to learn how to use regex or other Ruby tricks to check for certain words, whitespace characters, valid format etc in a given text line.
Let's say I have an order list that looks strictly like this in this format:
cost: 50 items: book,lamp
One space after semicolon, no space after each comma, no trailing whitespaces at the end and stuff like that.
How can I check for errors in this format using Ruby? This for example should fail my checks:
cost: 60 items:shoes,football
My goal was to split the string by a " " and check to see if the first word was "cost:", if the second word was a number and so on but I realized that splitting on a " " doesn't help me check for extra whitespaces as it just eats it up. Also doesn't help me check for trailing whitespaces. How do I go about doing this?
You could use the following regular expression.
r = /
\A # match beginning of string
cost:\s # match "cost:" followed by a space
\d+\s # match > 0 digits followed by a space
items:\s # match "items:" followed by a space
[[:alpha:]]+ # match > 0 lowercase or uppercase letters
(?:,[[:alpha:]]+) # match a comma followed by > 0 lowercase or uppercase
# letters in a non-capture group (?: ... )
* # perform the match on non-capture group >= 0 times
\z # match the end of the string
/x # free-spacing regex definition mode
"cost: 50 items: book,lamp" =~ r #=> 0 (a match, beginning at index 0)
"cost: 50 items: book,lamp,table" =~ r #=> 0 (a match, beginning at index 0)
"cost: 60 items:shoes,football" =~ r #=> nil (no match)
The regex can can of course be written in the normal manner:
r = /\Acost:\s\d+\sitems:\s[[:alpha:]]+(?:,[[:alpha:]]+)*\z/
or
r = /\Acost: \d+ items: [[:alpha:]]+(?:,[[:alpha:]]+)*\z/
though a whitespace character (\s) cannot be replaced by a space in the free-spacing mode definition (\x).

extracting data through regexps is returning nil

I'm trying to extract a pair of string from a parsed PDF and I have this extract:
Number:731 / 13/06/2016 1823750212 10/06/2016\n\n\n\n Articolo
http://rubular.com/r/GRI6j4Byz3
My goal is to get out the 731 and 1823750212 values.
I tried something like text[/Number:(.*)Articolo/] for the first steps but it's returning nil while on rubular it somewhat matches.
Any tips?
Whether the format of the string is fixed (dates and the long number,) this will do the trick:
text.scan /\ANumber:(\d+).*?(\d{5,})/
#⇒ [[ "731", "1823750212" ]]
I have assumed that we do not know the length of either string (representations of non-negative integers) to be extracted, only that the first follows "Number:", which is at the beginning of the string, and the second is preceded and followed by at least one space.
r = /
(?<=\A\Number:) # match beginning of string followed by 'Number:' in a
# positive lookbehind
\d+ # match one or more digits
| # or
(?<=\s) # match a whitespace char in a positive lookbehind
\d+ # match one or more digits
(?=\s) # match a whitespace char in a positive lookbehind
/x # free-spacing regex definition mode
str = "Number:731 / 13/06/2016 1823750212 10/06/2016\n\n\n\n Articolo"
str.scan(r)
#=> ["731", "1823750212"]
If there could be intervening spaces between the colon and "731", you could do modify the regex as follows.
r = /
\A # match beginning of string followed by 'Number:' in a
# positive lookbehind
Number: # match string 'Number:'
\s* # match zero or more spaces
\K # forget everything matched so far
\d+ # match one or more digits
| # or
(?<=\s) # match a whitespace char in a positive lookbehind
\d+ # match one or more digits
(?=\s) # match a whitespace char in a positive lookbehind
/x # free-spacing regex definition mode
str = "Number: 731 / 13/06/2016 1823750212 10/06/2016\n\n\n\n Articolo"
str.scan(r)
#=> ["731", "1823750212"]
Here \K must be used because Ruby does not support variable-length positive lookbehinds.

Ruby regex extracting words

I'm currently struggling to come up with a regex that can split up a string into words where words are defined as a sequence of characters surrounded by whitespace, or enclosed between double quotes. I'm using String#scan
For instance, the string:
' hello "my name" is "Tom"'
should match the words:
hello
my name
is
Tom
I managed to match the words enclosed in double quotes by using:
/"([^\"]*)"/
but I can't figure out how to incorporate the surrounded by whitespace characters to get 'hello', 'is', and 'Tom' while at the same time not screw up 'my name'.
Any help with this would be appreciated!
result = ' hello "my name" is "Tom"'.split(/\s+(?=(?:[^"]*"[^"]*")*[^"]*$)/)
will work for you. It will print
=> ["", "hello", "\"my name\"", "is", "\"Tom\""]
Just ignore the empty strings.
Explanation
"
\\s # Match a single character that is a “whitespace character” (spaces, tabs, and line breaks)
+ # Between one and unlimited times, as many times as possible, giving back as needed (greedy)
(?= # Assert that the regex below can be matched, starting at this position (positive lookahead)
(?: # Match the regular expression below
[^\"] # Match any character that is NOT a “\"”
* # Between zero and unlimited times, as many times as possible, giving back as needed (greedy)
\" # Match the character “\"” literally
[^\"] # Match any character that is NOT a “\"”
* # Between zero and unlimited times, as many times as possible, giving back as needed (greedy)
\" # Match the character “\"” literally
)* # Between zero and unlimited times, as many times as possible, giving back as needed (greedy)
[^\"] # Match any character that is NOT a “\"”
* # Between zero and unlimited times, as many times as possible, giving back as needed (greedy)
\$ # Assert position at the end of a line (at the end of the string or before a line break character)
)
"
You can use reject like this to avoid empty strings
result = ' hello "my name" is "Tom"'
.split(/\s+(?=(?:[^"]*"[^"]*")*[^"]*$)/).reject {|s| s.empty?}
prints
=> ["hello", "\"my name\"", "is", "\"Tom\""]
text = ' hello "my name" is "Tom"'
text.scan(/\s*("([^"]+)"|\w+)\s*/).each {|match| puts match[1] || match[0]}
Produces:
hello
my name
is
Tom
Explanation:
0 or more spaces followed by
either
some words within double-quotes OR
a single word
followed by 0 or more spaces
You can try this regex:
/\b(\w+)\b/
which uses \b to find the word boundary. And this web site http://rubular.com/ is helpful.

Resources