Negative lookahead to ignore boilerplate text in an email thread - ruby

I'm trying to write a negative lookahead for a regular expression that will ignore lines of boilerplate in the text of an email, specifically the bit that goes like:
> On Sat, Apr 27, 2013 at 11:39 PM, Jane Smith <jane.smith#example.com> wrote:
I want to match all the digits that are not in my negative lookahead. I tried this:
(?!(?:^>?*\sOn\s.*wrote:\s?)$)\d
But that always matches inside that line. I'm particularly confused because this regex:
(?:^>?*\sOn\s.*wrote:\s?)$
matches that entire line. Obviously I'm missing something, but I have no idea what it is. Thanks for any help.

try this pattern, but don't forget to remove the empty matches:
> On .*+\n> wrote:|(\d++)

Related

Regex ruby syntax to select a number while excluding specific ones

I am still struggling to find some ruby regex syntax despite the numerous documentation on-line. I have an array of string and I am looking for strings that include one number (whatever the number of digits) but not specific one (let's say for instance dates from 19XX to 201X).
I manage to get the regex for "the line contain a number"
.*\p{N}.*
I manage to get "exclude the line if this number is a year"
(?!19\d\d|20[0-1]\d)\d{4}
But I fail to combine both. I would need something that would intuitively be written as such
(.*\p{N}.*)&&(?!19\d\d|20[0-1]\d)\d{4}
But I am not sure how an AND operator can be used.
Here it is:
^(?!.*19\d\d.*)(?!.*20[01]\d.*)(.*\p{N}.*)$
You want a string that:
(?!.*19\d\d.*) doesn't contains 19xx
(?!.*20[01]\d.*) doesn't contains 200x or 201x
(.*\p{N}+.*) contains, at least, one digit
In regex && means, well, literal && and not and operator
If you want to capture numbers that are not in the range 1900-2019 you can replace with:
(?!\b19\d\d\b)(?!\b20[01]\d\b)(\b\p{N}+\b)
You can test it here
While the solution by Thomas is probably the best one, another option would be to go without negation: just select everything, that matches:
re = /\D(
[03-9]\d*|
(?:1|2|20)(?=\D)|
1[0-8]\d*|
19\d?(?=\D)|
19\d{3,}|
20[2-9]\d*|
20[01]?(?=\D)|
20[01]\d{2,}
)/x
▶ 'Here 2014 and 1945 and 1878 and 20000 and 2 and 19 and 195 and 203'.scan re
#⇒ [["1878"], ["20000"], ["2"], ["19"], ["195"], ["203"]]

Regex matching of incomplete strings "as-you-type"

What would be a good practice to match strings while the user is typing them in?
I would like to take this regex for matching a time as an example:
/(\d\d?)(:\d\d?)?\s?(am|pm)?/ matches 12:30 am, 9 pm, 23:00 perfectly fine.
It only becomes tricky when you want to provide feedback while the user is typing in that string.
12: is not matched, neither would 12:30 p.
My solution would be to use a second, independend, regex for for the matching of incomplete strings that includes all input possibilities:
/(\d\d?)(:\d\d?|:)?\s?(|a|p|am|pm)?/ will match 12: and 12:30 p just fine.
Is there a better, more elegant way, to do this?

Regular expression match that excludes characters inside parenthesis

I have the following types of strings.
BILL SMITH (USA)
WINTHROP (FR)
LORD AT WAR (GB)
KIM SMITH
With these strings, I have the following constraints:
1. all caps
2. can be 2 to 18 charters long
3. should not have any white spaces or carriage returns at the end
4. the country abbreviation inside the parens should be excluded
5. some of the names will not have the country in parens and they should be matched too
After applying my regular expression I'd like to get the following:
BILL SMITH (USA) => BILL SMITH
WINTHROP (FR) => WINTHROP
LORD AT WAR (GB) = LORD AT WAR
KIM SMITH => KIM SMITH
I came up with the following regular expression but I'm not getting any matches:
String.scan(\([A-Z \s*]{1,18})(^?!(\([A-Z]{1,3}\)))\)
I been banging my head on this for a while so if anyone can point error I'd appreciated it.
UPDATE:
I've gotten some great responses, however, so far none of the regular expression solutions have met all the constraints. The tricky part seems to be that some of the string has the country in parenthesis and some don't. In one case strings without the country was not being matched and in another it was returning the correct string along with the country abbreviation without the parenthesis. (See the comments on the second response.) One point of clarification: All of the strings that I will be matching will be the start point of the string. Not sure if that helps or not. Thanks again for all your help.
Here's one solution:
^((?:[A-Z]|\s){2,18}+?)(?:\s\([A-Z]+\))?$
See it on Rubular. Note that it counts 18 characters before the parenthesis - not sure how you want it to behave specificically. If you want to make sure the whole line isn't more than 18 characters, I suggest you just do unless line.length < 18 ... Similarly, if you want to make sure there is no whitespace at the end, I recommend you use line.strip. That'll greatly reduce the complexity of the Regexp you need and make your code more readable.
Edit: also works when no parentheses are used after the name.
The biggest error is that you wrote (^?!...) where you meant (?=...). The former means "an optional start-of-line anchor, followed by !, followed by ..., inside a capture group"; the latter means "a position in the string that is followed by ...". Fixing that, as well as makin a few other tweaks, and adding the requirement that the initial string end with a letter, we get:
[A-Z\s]{1,17}[A-Z])(?=\s*\([A-Z]{1,3}\)
Update based on OP comments: Since this will always match at the start of a string, you can use \A to "anchor" your pattern to the start of the string. You can then get rid of the lookahead assertion. This:
\A[A-Z][A-Z\s]{0,16}[A-Z]
matches start-of-string, followed by an uppercase letter, followed by up to 16 characters that are either uppercase letters or whitespace characters, followed by an uppercase letter.
You can also just use gsub to remove the part(s) you don't want. To remove everything in parenthesis you could do:
str.gsub(/\s*\([^)]*\)/, '')

Regex to match a String with optional Conditions [duplicate]

This question already has answers here:
Closed 11 years ago.
Possible Duplicate:
How do I make part of a regular expression optional in Ruby?
I'm trying to build a regular expression with rubular to match:
On Feb 23, 2011, at 10:22 , James Bond wrote:
OR
On Feb 23, 2011, at 10:22 AM , James Bond wrote:
Here's what I have so far, but for some reason it's not matching? Ideas?
(On.* (?:Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec) \d{1,2}, [12]\d{3}.* at \d{1,2}:\d{1,2} (?:AM|PM),.*wrote:)
How can I make the AM/PM text optional? Either match AM/PM or neither?
This seems to catch the date info. I purposely captured in groups, making it easier to build a real date:
regex = /^On (\w+ \d+, \d+), \w+ (\S+) (\w*)\s*,/
[
'On Feb 23, 2011, at 10:22 , James Bond wrote:',
'On Feb 23, 2011, at 10:22 AM , James Bond wrote:'
].each do |ary|
ary =~ regex
puts "#{$1} #{$2} #{$3}"
end
# >> Feb 23, 2011 10:22
# >> Feb 23, 2011 10:22 AM
I purposed didn't try to match on the months. Your sample strings look like quote headers from email messages. Those are very standard and generated by software, so you should see a lot of consistency in the format, allowing some simplification in the regex. If you can't trust those, then go with the matches on month name abbreviations to help ignore false-positive matches. The same things apply for the day, year, and time values.
The important thing in the regex is how to deal with the AM/PM when it's missing.
maybe this
(On\s+(?:Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec)\s+\d{1,2},\s+[12]\d{3},\s+at\s+\d{1,2}:\d{1,2}\s+(?:AM|PM)*,.*wrote:)
however, if you can be verify and be sure that only these lines are unique, you don't have to use a whole range of regex. Maybe it starts with "On" and ends with "wrote:" , your regex might then simple be /^On.*wrote:/
Just use the question mark operator after any group you want to be optional, so in this case:
(?:(?:AM|PM) )?
Be sure to match the space as well, otherwise the strings without AM/PM need to include two spaces. The solution with (?:AM|PM)* would also match AMAMPM, so that's probably not what you want. But why do you match those group without creating backreferences? Aren't you going to use the values?
For info on backreferences:
http://www.regular-expressions.info/brackets.html

regex for matching german postal codes but not a

following string:
23434 5465434
58495 / 46949345
58495 - 46949345
58495 / 55643
d 44444 ssdfsdf
64784
45643 dfgh
58495/55643
48593/48309596
675643235
34565435 34545
it only want to extract the bold ones. its a five digit number(german).
it should not match telephone numbers 43564 366334 or 45433 / 45663,etc as in my example above.
i tried something like ^\b\d{5} but thats not a good beginning.
some hints for me to get this working?
thanks for all hints
You could add a negative look-ahead assertion to avoid the matches with phone numbers.
\b[0124678][0-9]{4}\b(?!\s?[ \/-]\s?[0-9]+)
If you're using Ruby 1.9, you can add a negative look-behind assertion as well.
You haven't specified what distinguishes the number you're trying to search for.
Based on the example string you gave, it looks like you just want:
^(\d{5})\n
Which matches lines that start with 5 digits and contain nothing else.
You might want to permit some spaces after the first 5 digits (but nothing else):
^(\d{5})\s*\n
I'm not completely sure about the specified rules. But if you want lines that start with 5 digits and do not contain additional digits, this may work:
^(\d{5})[^\d]*$
If leading white space is okay, then:
^\s*(\d{5})[^\d]*$
Here is the Rubular link that shows the result.
^\D*(\d{5})(\s(\D)*$|()$)
This should (it's untested) match:
line starting with five digits (or some non-digits and then five digits), then
a space, and ending with some non-numbers
line starting and ending with five
digits (or some non-digits and then five digits)
\1 would be the five digits
\2 would be the whole second half, if any
\3 would be the word after the digits, if any
edited to fit the asker's edited question
edit again: I came up with a much more elegant solution:
^\D*(\d{5})\D*$

Resources