How do I regex a name and an email out of the 3 major email clients in ruby? - ruby

I thought I had it figured out, but it appears that my regex still has quirks in it. Basically I would like to use the same regex pattern to match the following major email clients (Gmail, Yahoo, and regular email):
"Brian Mang" <brian.mang#email.com> -- Case1
Brian Mang (brian.mang#email.com) -- Case2
<brian.mang#email.com> -- Case3
brian.mang#email.com -- Case4
I had the following regex pattern:
/[\W"]*(?<name>.*?)[\"]*?\s*[<(](?<email>\w.*)[>)]/.match(contact)
and it works for all Cases 1-3, but I cant get it to pick up case 4, I tried messing around with it but cant figure it out cause it breaks the other cases. Any idea what I need to change/modify to make my regex pick up all of the 4 cases? Thank you.

Try this
[\W"]*(?<name>.*?)[\"]*?\s*[<(]?(?<email>\S+#\S+)[>)]?
See it here on Regexr
I made the classes surrounding the address optional and changed the part that matches the email to \S+#\S+ that means at least one non-whitespace followed by a # then at least one more non-whitespace character.
Since the above version matches the closing character also, you can restrict the part after the # a bit more
[\W"]*(?<name>.*?)[\"]*?\s*[<(]?(?<email>\S+#[^\s>)]+)[>)]?
see it here on Regexr

Edit: This one works for all four:
[\W"]*(?<name>.*?)[\"]*?\s*[<(]?(?<email>\S+#[^)>]+)[>)]?

Related

Regex for Git commit message

I'm trying to come up with a regex for enforcing Git commit messages to match a certain format. I've been banging my head against the keyboard modifying the semi-working version I have, but I just can't get it to work exactly as I want. Here's what I have now:
/^([a-z]{2,4}-[\d]{2,5}[, \n]{1,2})+\n{1}^[\w\n\s\*\-\.\:\'\,]+/i
Here's the text I'm trying to enforce:
AB-1432, ABC-435, ABCD-42
Here is the multiline description, following a blank
line after the Jira issue IDs
- Maybe bullet points, with either dashes
* Or asterisks
Currently, it matches that, but it will also match if there's no blank line after the issue IDs, and if there's multiple blank lines after.
Is there anyway to enforce that, or will I just have to live with it?
It's also pretty ugly, I'm sure there's a more succinct way to write that out.
Thanks.
Your regex allows for \n as one of the possible characters after the required newline, so that's why it matches when there are multiple.
Here's a cleaned up regex:
/^([a-z]{2,4}-\d{2,5}(?=[, \n]),? ?\n?)+^\n([-\w\s*.:',]+\n)+/i
Notes:
This requires at least one [-\w\s*.:',] character before the next newline.
I changed the issue IDs to have one possible comma, space, and newline, in that order (up to one of each). Can you use lookaheads? If so, I added (?=[, \n]) to make sure the issue ID is followed by at least one of those characters.
Also notice that many of the characters don't need to be escaped in a character class.

Regex for matching everything before trailing slash, or first question mark?

I'm trying to come up with a regex that will elegantly match everything in an URL AFTER the domain name, and before the first ?, the last slash, or the end of the URL, if neither of the 2 exist.
This is what I came up with but it seems to be failing in some cases:
regex = /[http|https]:\/\/.+?\/(.+)[?|\/|]$/
In summary:
http://nytimes.com/2013/07/31/a-new-health-care-approach-dont-hide-the-price/ should return
2013/07/31/a-new-health-care-approach-dont-hide-the-price
http://nytimes.com/2013/07/31/a-new-health-care-approach-dont-hide-the-price?id=2 should return
2013/07/31/a-new-health-care-approach-dont-hide-the-price
http://nytimes.com/2013/07/31/a-new-health-care-approach-dont-hide-the-price should return
2013/07/31/a-new-health-care-approach-dont-hide-the-price
Please don't use Regex for this. Use the URI library:
require 'uri'
str_you_want = URI("http://nytimes.com/2013/07/31/a-new-health-care-approach-dont-hide-the-price").path
Why?
See everything about this famous question for a good discussion of why these kinds of things are a bad idea.
Also, this XKCD really says why:
In short, Regexes are an incredibly powerful tools, but when you're dealing with things that are made from hundred page convoluted standards when there is already a library for doing it faster, easier, and more correctly, why reinvent this wheel?
If lookaheads are allowed
((2[0-9][0-9][0-9].*)(?=\?\w+)|(2[0-9][0-9][0-9].*)(?=/\s+)|(2[0-9][0-9][0-9].*).*\w)
Copy + Paste this in http://regexpal.com/
See here with ruby regex tester: http://rubular.com/r/uoLLvTwkaz
Image using javascript regex, but it works out the same
(?=) is just a a lookahead
I basically set up three matches from 2XXX up to (in this order):
(?=\?\w+) # lookahead for a question mark followed by one or more word characters
(?=/\s+) # lookahead for a slash followed by one or more whitespace characters
.*\w # match up to the last word character
I'm pretty sure that some parentheses were not needed but I just copy pasted.
There are essentially two OR | expressions in the (A|B|C) expression. The order matters since it's like a (ifthen|elseif|else) type deal.
You can probably fix out the prefix, I just assumed that you wanted 2XXX where X is a digit to match.
Also, save the pitchforks everyone, regular expressions are not always the best but it's there for you when you need it.
Also, there is xkcd (https://xkcd.com/208/) for everything:

Regex Tag-Within-Tag

I have a fairly simple regex problem for a little personal experiment that I haven't quite figured out.
In a string, I might have several <tag>[some characters here] that I need to match. The obvious way to do it would be with a /<tag>\[.*?\]/ regex, to match any characters after the <tag>[ and before the ].
I'd like to be able to have <tag>s within <tag>s, however. This causes a problem. If I had the following:
<tag>[some characters <tag>[in here] to match]
the regex would stop matching as soon as it reached the first closing-bracket, and completely fail to match the last part of the statement. I've tried to solve the problem by telling the regex to ignore any internal <tag>s, so I can do a match on the stripped contents later. I haven't quite gotten it working. The closest I've come is:
/<tag>\[(.*?(?:<tag>\[.*?\])*?.*?)\]/
which doesn't quite work. I would hope that it would match any number of characters, and any inner tags if they exist. It still has trouble with that first closing bracket, however.
Maybe somebody who's better at regular expressions knows a good solution to this.
Though you should probably drop regex and do this manually if the mini-language becomes more complex, you can use recursive regex.
Your regex would look something like this:
/(?<reg>(\w+\[([^\]\[]|\g<reg>)*\]))/
You can see it in action here: http://rubular.com/r/9F7isgZpj9
Here is the regex broken down to its parts:
(?<reg>( # start a regex named "reg"
\w+ # the tag name
\[ # open bracket
( # which can contain
[^\]\[] # non-bracket characters
| # or
\g<reg> # sub-tags (this is where the magic happens)
)* # zero or more times
\] # close the tag
)
)

how to use regex negation string

can any body tell me how to use regex for negation of string?
I wanna find all line that start with public class and then any thing except first,second and finally any thing else.
for example in the result i expect to see public class base but not public class myfirst:base
can any body help me please??
Use a negative lookahead:
public\s+class\s+(?!first|second).+
If Peter is correct and you're using Visual Studio's Find feature, this should work:
^:b*public:b+class:b+~(first|second):i.*$
:b matches a space or tab
~(...) is how VS does a negative lookahead
:i matches a C/C++ identifier
The rest is standard regex syntax:
^ for beginning of line
$ for end of line
. for any character
* for zero or more
+ for one or more
| for alternation
Both the other two answers come close, but probably fail for different reasons.
public\s+class\s+(?:(?!first|second).)+
Note how there is a (non-capturing) group around the negative lookahead, to ensure it applies to more than just the first position.
And that group is less restrictive - since . excludes newline, it's using that instead of \S, and the $ is not necessary - this will exclude the specified words and match others.
No slashes wrapping the expression since those aren't required in everything and may confuse people that have only encountered string-based regex use.
If this still fails, post the exact content that is wrongly matched or missed, and what language/ide you are using.
Update:
Turns out you're using Visual Studio, which has it's own special regex implementation, for some unfathomable reason. So, you'll be wanting to try this instead:
public:b+class:b+~(first|second)+$
I have no way of testing that - if it doesn't work, try dropping the $, but otherwise you'll have to find a VS user. Or better still, the VS engineer(s) responsible for this stupid non-standard regex.
Here is something that should work for you
/public\sclass\s(?:[^fs\s]+|(?!first|second)\S)+(?=\s|$)/
The second look a head could be changed to a $(end of line) or another anchor that works for your particular use case, like maybe a '{'
Edit: Try changing the last part to:
(?=\s|$)

regex to match trailing whitespace, but not lines which are entirely whitespace (indent placeholders)

I've been trying to construct a ruby regex which matches trailing spaces - but not indentation placeholders - so I can gsub them out.
I had this /\b[\t ]+$/ and it was working a treat until I realised it only works when the line ends are [a-zA-Z]. :-( So I evolved it into this /(?!^[\t ]+)[\t ]+$/ and it seems like it's getting better, but it still doesn't work properly. I've spent hours trying to get this to work to no avail. Please help.
Here's some text test so it's easy to throw into Rubular, but the indent lines are getting stripped so it'll need a few spaces and/or tabs. Once lines 3 & 4 have spaces back in, it shouldn't match on lines 3-5, 7, 9.
some test test
some test test
some other test (text)
some other test (text)
likely here{ dfdf }
likely here{ dfdf }
and this ;
and this ;
Alternatively, is there an simpler / more elegant way to do this?
If you're using 1.9, you can use look-behind:
/(?<=\S)[\t ]+$/
but unfortunately, it's not supported in older versions of ruby, so you'll have to handle the captured character:
str.gsub(/(\S)[\t ]+$/) { $1 }
Your first expression is close, and you just need to change the \b to a negated character class. This should work better:
/([^\t ])[\t ]+$
In plain words, this matches all tabs and spaces on lines that follow a character that is not a tab or a space.
Wouldn't this help?
/([^\t ])([\t ]+)$/
You need to do something with the matched last non-space character, though.
edit: oh, you meant non blank lines. Then you would need something like /([^\s])\s+/ and sub them with the first part
I'm not entirely sure what you are asking for, but wouldn't something like this work if you just want to capture the trailing whitespaces?
([\s]+)$
or if you only wanted to capture tabs
([ \t]+)$
Since regexes are greedy, they'll capture as much as they can. You don't really need to give them context beforehand if you know what you want to capture.
I still am not sure what you mean by trailing indentation placeholders, so I'm sorry if I'm misunderstanding.
perhaps this...
[\t|\s]+?$
or
[ ]+$

Resources