Regex for Git commit message - ruby

I'm trying to come up with a regex for enforcing Git commit messages to match a certain format. I've been banging my head against the keyboard modifying the semi-working version I have, but I just can't get it to work exactly as I want. Here's what I have now:
/^([a-z]{2,4}-[\d]{2,5}[, \n]{1,2})+\n{1}^[\w\n\s\*\-\.\:\'\,]+/i
Here's the text I'm trying to enforce:
AB-1432, ABC-435, ABCD-42
Here is the multiline description, following a blank
line after the Jira issue IDs
- Maybe bullet points, with either dashes
* Or asterisks
Currently, it matches that, but it will also match if there's no blank line after the issue IDs, and if there's multiple blank lines after.
Is there anyway to enforce that, or will I just have to live with it?
It's also pretty ugly, I'm sure there's a more succinct way to write that out.
Thanks.

Your regex allows for \n as one of the possible characters after the required newline, so that's why it matches when there are multiple.
Here's a cleaned up regex:
/^([a-z]{2,4}-\d{2,5}(?=[, \n]),? ?\n?)+^\n([-\w\s*.:',]+\n)+/i
Notes:
This requires at least one [-\w\s*.:',] character before the next newline.
I changed the issue IDs to have one possible comma, space, and newline, in that order (up to one of each). Can you use lookaheads? If so, I added (?=[, \n]) to make sure the issue ID is followed by at least one of those characters.
Also notice that many of the characters don't need to be escaped in a character class.

Related

Multi-Line Regex: Find A where B is absent

I have been looking through a lot on Regex lately and have seen a lot of answers involving the matching of one word, where a second word is absent. I have seen a lot of Regex Examples where I can have a Regex search for a given word (or any more complex regex in its place) and find where a word is missing.
It seems like the works very well on a line by line basis, but after including the multi-line mode it still doesn't seem to match properly.
Example: Match an entire file string where the word foo is included, but the word bar is absent from the file. What I have so far is (?m)^(?=.*?(foo))((?!bar).)*$ which is based off the example link. I have been testing with a Ruby Regex tester, but I think it is a open ended regex problem/question. It seems to match smaller pieces, I would like to have it either match/not match on the entire string as one big chunk.
In the provided example above, matches are found on a line by line basis it seems. What changes need to be made to the regex so it applies over the ENTIRE string?
EDIT: I know there are other more efficient ways to solve this problem that doesn't involve using a regex. I am not looking for a solution to the problem using other means, I am asking from a theoretical regex point of view. It has a multi-line mode (which looks to "work"), it has negative/positive searching which can be combined on a line by line basis, how come combining these two principals doesn't yield the expected result?
Sawa's answer can be simplified, all that's needed is a positive lookahead, a negative lookahead, and since you're in multiline mode, .* takes care of the rest:
/(?=.*foo)(?!.*bar).*/m
Multiline means that . matches \n also, and matches are greedy. So the whole string will match without the need for anchors.
Update
#Sawa makes a good point for the \A being necessary but not the \Z.
Actually, looking at it again, the positive lookahead seems unnecessary:
/\A(?!.*bar).*foo.*/m
A regex that matches an entire string that does not include foo is:
/\A(?!.*foo.*).*\z/m
and a regex that matches from the beginning of an entire string that includes bar is:
/\A.*bar/m
Since you want to satisfy both of these, take a conjunction of these by putting one of them in a lookahead:
/\A(?=.*bar)(?!.*foo.*).*\z/m

How do I regex a name and an email out of the 3 major email clients in ruby?

I thought I had it figured out, but it appears that my regex still has quirks in it. Basically I would like to use the same regex pattern to match the following major email clients (Gmail, Yahoo, and regular email):
"Brian Mang" <brian.mang#email.com> -- Case1
Brian Mang (brian.mang#email.com) -- Case2
<brian.mang#email.com> -- Case3
brian.mang#email.com -- Case4
I had the following regex pattern:
/[\W"]*(?<name>.*?)[\"]*?\s*[<(](?<email>\w.*)[>)]/.match(contact)
and it works for all Cases 1-3, but I cant get it to pick up case 4, I tried messing around with it but cant figure it out cause it breaks the other cases. Any idea what I need to change/modify to make my regex pick up all of the 4 cases? Thank you.
Try this
[\W"]*(?<name>.*?)[\"]*?\s*[<(]?(?<email>\S+#\S+)[>)]?
See it here on Regexr
I made the classes surrounding the address optional and changed the part that matches the email to \S+#\S+ that means at least one non-whitespace followed by a # then at least one more non-whitespace character.
Since the above version matches the closing character also, you can restrict the part after the # a bit more
[\W"]*(?<name>.*?)[\"]*?\s*[<(]?(?<email>\S+#[^\s>)]+)[>)]?
see it here on Regexr
Edit: This one works for all four:
[\W"]*(?<name>.*?)[\"]*?\s*[<(]?(?<email>\S+#[^)>]+)[>)]?

How to conflate consecutive gsubs in ruby

I have the following
address.gsub(/^\d*/, "").gsub(/\d*-?\d*$/, "").gsub(/\# ?\d*/,"")
Can this be done in one gsub? I would like to pass a list of patterns rather then just one pattern - they are all being replaced by the same thing.
You could combine them with an alternation operator (|):
address = '6 66-666 #99 11-23'
address.gsub(/^\d*|\d*-?\d*$|\# ?\d*/, "")
# " 66-666 "
address = 'pancakes 6 66-666 # pancakes #99 11-23'
address.gsub(/^\d*|\d*-?\d*$|\# ?\d*/,"")
# "pancakes 6 66-666 pancakes "
You might want to add little more whitespace cleanup. And you might want to switch to one of:
/\A\d*|\d*-?\d*\z|\# ?\d*/
/\A\d*|\d*-?\d*\Z|\# ?\d*/
depending on what your data really looks like and how you need to handle newlines.
Combining the regexes is a good idea--and relatively simple--but I'd like to recommend some additional changes. To wit:
address.gsub(/^\d+|\d+(?:-\d+)?$|\# *\d+/, "")
Of your original regexes, ^\d* and \d*-?\d*$ will always match, because they don't have to consume any characters. So you're guaranteed to perform two replacements on every line, even if that's just replacing empty strings with empty strings. Of my regexes, ^\d+ doesn't bother to match unless there's at least one digit at the beginning of the line, and \d+(?:-\d+)?$ matches what looks like an integer-or-range expression at the end of the line.
Your third regex, \# ?\d*, will match any # character, and if the # is followed by a space and some digits, it'll take those as well. Judging by your other regexes and my experience with other questions, I suspect you meant to match a # only if it's followed by one or more digits, with optional spaces intervening. That's what my third regex does.
If any of my guesses are wrong, please describe what you were trying to do, and I'll do my best to come up with the right regex. But I really don't think those first two regexes, at least, are what you want.
EDIT (in answer to the comment): When working with regexes, you should always be aware of the distinction between a regex the matches nothing and a regex that doesn't match. You say you're applying the regexes to street addresses. If an address doesn't happen to start with a house number, ^\d* will match nothing--that is, it will report a successful match, said match consisting of the empty string preceding the first character in the address.
That doesn't matter to you, you're just replacing it with another empty string anyway. But why bother doing the replacement at all? If you change the regex to ^\d+, it will report a failed match and no replacement will be performed. The result is the same either way, but the "matches noting" scenario (^\d*) results in a lot of extra work that the "doesn't match" scenario avoids. In a high-throughput situation, that could be a life-saver.
The other two regexes bring additional complications: \d*-?\d*$ could match a hyphen at the end of the string (e.g. "123-", or even "-"); and \# ?\d* could match a hash symbol anywhere in string, not just as part of an apartment/office number. You know your data, so you probably know neither of those problems will ever arise; I'm just making sure you're aware of them. My regex \d+(?:-\d+)?$ deals with the trailing-hyphen issue, and \# *\d+ at least makes sure there are digits after the hash symbol.
I think that if you combine them together in a single gsub() regex, as an alternation,
it changes the context of the starting search position.
Example, each of these lines start at the beginning of the result of the previous
regex substitution.
s/^\d*//g
s/\d*-?\d*$//g
s/\# ?\d*//g
and this
s/^\d*|\d*-?\d*$|\# ?\d*//g
resumes search/replace where the last match left off and could potentially produce a different overall output, especially since a lot of the subexpressions search for similar
if not the same characters, distinguished only by line anchors.
I think your regex's are unique enough in this case, and of course changing the order
changes the result.

Inserting characters before whatever is on a line, for many lines

I have been looking at regular expressions to try and do this, but the most I can do is find the start of a line with ^, but not replace it.
I can then find the first characters on a line to replace, but can not do it in such a way with keeping it intact.
Unfortunately I donĀ“t have access to a tool like cut since I am on a windows machine...so is there any way to do what I want with just regexp?
Use notepad++. It offers a way to record an sequence of actions which then can be repeated for all lines in the file.
Did you try replacing the regular expression ^ with the text you want to put at the start of each line? Also you should use the multiline option (also called m in some regex dialects) if you want ^ to match the start of every line in your input rather than just the first.
string s = "test test\ntest2 test2";
s = Regex.Replace(s, "^", "foo", RegexOptions.Multiline);
Console.WriteLine(s);
Result:
footest test
footest2 test2
I used to program on the mainframe and got used to SPF panels. I was thrilled to find a Windows version of the same editor at Command Technology. Makes problems like this drop-dead simple. You can use expressions to exclude or include lines, then apply transforms on just the excluded or included lines and do so inside of column boundaries. You can even take the contents of one set of lines and overlay the contents of another set of lines entirely or within column boundaries which makes it very easy to generate mass assignments of values to variables and similar tasks. I use Notepad++ for most stuff but keep a copy of SPFSE around for special-purpose editing like this. It's not cheap but once you figure out how to use it, it pays for itself in time saved.

regex to match trailing whitespace, but not lines which are entirely whitespace (indent placeholders)

I've been trying to construct a ruby regex which matches trailing spaces - but not indentation placeholders - so I can gsub them out.
I had this /\b[\t ]+$/ and it was working a treat until I realised it only works when the line ends are [a-zA-Z]. :-( So I evolved it into this /(?!^[\t ]+)[\t ]+$/ and it seems like it's getting better, but it still doesn't work properly. I've spent hours trying to get this to work to no avail. Please help.
Here's some text test so it's easy to throw into Rubular, but the indent lines are getting stripped so it'll need a few spaces and/or tabs. Once lines 3 & 4 have spaces back in, it shouldn't match on lines 3-5, 7, 9.
some test test
some test test
some other test (text)
some other test (text)
likely here{ dfdf }
likely here{ dfdf }
and this ;
and this ;
Alternatively, is there an simpler / more elegant way to do this?
If you're using 1.9, you can use look-behind:
/(?<=\S)[\t ]+$/
but unfortunately, it's not supported in older versions of ruby, so you'll have to handle the captured character:
str.gsub(/(\S)[\t ]+$/) { $1 }
Your first expression is close, and you just need to change the \b to a negated character class. This should work better:
/([^\t ])[\t ]+$
In plain words, this matches all tabs and spaces on lines that follow a character that is not a tab or a space.
Wouldn't this help?
/([^\t ])([\t ]+)$/
You need to do something with the matched last non-space character, though.
edit: oh, you meant non blank lines. Then you would need something like /([^\s])\s+/ and sub them with the first part
I'm not entirely sure what you are asking for, but wouldn't something like this work if you just want to capture the trailing whitespaces?
([\s]+)$
or if you only wanted to capture tabs
([ \t]+)$
Since regexes are greedy, they'll capture as much as they can. You don't really need to give them context beforehand if you know what you want to capture.
I still am not sure what you mean by trailing indentation placeholders, so I'm sorry if I'm misunderstanding.
perhaps this...
[\t|\s]+?$
or
[ ]+$

Resources