Regex for finding at most n consecutive patterns - ruby

Lets say our pattern is a regex for capital letters (but we could have a more complex pattern than searching for capitals)
To find at least n consecutive patterns (in this case, the pattern we are looking for is simply a capital letter), we can do this:
(Using Ruby)
somestring = "ABC deFgHij kLmN pQrS XYZ abcdEf"
at_least_2_capitals = somestring.scan(/[A-Z][A-Z]+/)
=> ["ABC", "XYZ"]
at_least_3_capitals = somestring.scan(/[A-Z]{3}[A-Z]*/)
=> ["ABC", "XYZ"]
However, how do I search for at most n consecutive patterns, for example, at most one consecutive capital letter:
matches = somestring.scan(/ ??? /)
=> [" deFgHij kLmN pQrS ", " abcdEf"]
Detailed strategy
I read that I need to negate the "at least" regex, by turning it into a DFA, negating the accept states, (then converting it back to NFA, though we can leave it as it is) so to write it as a regex. If we think of encountering our pattern as receiving a '1' and not receiving the pattern as receiving a '0', we can draw a simple DFA diagram (where n=1, we want at most one of our pattern):
Specifically, I was wondering how this becomes a regex. Generally, I hope to find how to find "at most" with regex, as my regex skills feel stunted with "at least" alone.
Trip Hazards - not quite the right solution in spirit
Note that this question is not a dupicate of this post, as using the accepted methodology there would give:
somestring.scan(/[A-Z]{2}[A-Z]*(.*)[A-Z]{2}[A-Z]*/)
=> [[" deFgHij kLmN pQrS X"]]
Which is not what the DFA shows, not just because it misses the second sought match - more importantly that it includes the 'X', which it should not, as 'X' is followed by another capital, and from the DFA we see that a capital which is followed by another capital is not an accept state.
You could suggest
somestring.split(/[A-Z]{2}[A-Z]*/)
=> ["", " deFgHij kLmN pQrS ", " abcdEf"]
(Thanks to Rubber Duck)
but I still want to know how to find at most n occurrences using regex alone. (For knowledge!)

Why your attempt does not work
There are a few problems with your current attempt.
The reason that X is part of the match is that .* is greedy and consumes as much as possible - hence, leaving only the required two capital letters to be matched by the trailing bit. This could be fixed with a non-greedy quantifier.
The reason why you don't get the second match is twofold. First, you require two trailing capital letters to be there, but instead there is the end of the string. Second, matches cannot overlap. The first match includes at least two trailing capital letters, but the second would need to match these again at the start which is not possible.
There are more hidden problems: try an input with four consecutive capital letters - it can give you an empty match (provided you use the non-greedy quantifier - the greedy one has even worse problems).
Fixing all of these with the current approach is hard (I tried and failed - check the edit history of this post if you want to see my attempt until I decided to scrap this approach altogether). So let's try something else!
Looking for another solution
What is it that we want to match? Disregarding the edge cases, where the match starts at the beginning of the string or ends at the end of the string, we want to match:
(non-caps) 1 cap (non-caps) 1 cap (non-caps) ....
This is ideal for Jeffrey Friedl's unrolling-the-loop. Which looks like
[^A-Z]+(?:[A-Z][^A-Z]+)*
Now what about the edge cases? We can phrase them like this:
We want to allow a single capital letter at the beginning of the match, only if it's at the beginning of the string.
We want to allow a single capital letter at the end of the match, only if it's at the end of the string.
To add these to our pattern, we simply group a capital letter with the appropriate anchor and mark both together as optional:
(?:^[A-Z])?[^A-Z]+(?:[A-Z][^A-Z]+)*(?:[A-Z]$)?
Now it's really working. And even better, we don't need capturing any more!
Generalizing the solution
This solution is easily generalized to the case of "at most n consecutive capital letters", by changing each [A-Z] to [A-Z]{1,n} and thereby allowing up to n capital letters where there is only one allowed so far.
See the demo for n = 2.

tl;dr
To match words containing at most N PATTERNs, use the regex
/\b(?:\w(?:(?<!PATTERN)|(?!(?:PATTERN){N,})))+\b/
For example, to match words containing at most 1 capital letter,
/\b(?:\w(?:(?<![A-Z])|(?!(?:[A-Z]){1,})))+\b/
This works for multi-character patterns too.
Clarification Needed
I'm afraid your examples may cause confusion. Let's add a few words:
somestring = "ABC deFgHij kLmN pQrS XYZ abcdEf mixedCaps mixeDCaps mIxedCaps mIxeDCaps T TT t tt"
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
Now, rerunning your at-least-2-capitals regex returns
at_least_2_capitals = somestring.scan(/[A-Z][A-Z]+/)
=> ["ABC", "XYZ", "DC", "DC", "TT"]
Note how complete words are not captured! Are you sure this is what you wanted? I ask, of course, because in your latter examples, your at-most-1-capital regex returns complete words, instead of just the capital letters being captured.
Solution
Here's the solution either way.
First, for matching just patterns (and not entire words, as consistent with your initial examples), here's a regex for at-most-N-PATTERNs:
/(?<!PATTERN)(?!(?:PATTERN){N+1,})(?:PATTERN)+/
For example, the at-most-1-capitals regex would be
/(?<![A-Z])(?!(?:[A-Z]){2,})(?:[A-Z])+/
and returns
=> ["F", "H", "L", "N", "Q", "S", "E", "C", "DC", "I", "C", "I", "DC", "T", "TT"]
To further exemplify, the at-most-2-capitals regex returns
=>
Finally, if you wanted to match entire words that contained at most a certain number of consecutive patterns, then here's a whole different approach:
/\b(?:\w(?:(?<![A-Z])|(?![A-Z]{1,})))+\b/
This returns
["deFgHij", "kLmN", "pQrS", "abcdEf", "mixedCaps", "mIxedCaps", "T", "t", "tt"]
The general form is
/\b(?:\w(?:(?<!PATTERN)|(?!(?:PATTERN){N,})))+\b/
You can see all these examples at http://ideone.com/hImmZr.

to find "at most" with a regex, you use the suffix {1,n} (possibly preceded by a negative lookbehind and followed by a positive lookahead), so it seems that what you want is:
irb(main):006:0> somestring.scan(/[A-Z]{1,2}/)
=> ["AB", "C", "F", "H", "L", "N", "Q", "S", "XY", "Z", "E"]
or
irb(main):007:0> somestring.scan(/(?<![A-Z])[A-Z]{1,2}(?![A-Z])/)
=> ["F", "H", "L", "N", "Q", "S", "E"]
EDIT: if the OP still wants "the longest strings not including more than two uppercase letters", it can use:
irb(main):025:0> somestring.scan(/[^A-Z]+(?:[A-Z]{1,2}[^A-Z]+)*/)
=> [" deFgHij kLmN pQrS ", " abcdEf"]
(but that regex possibly won't match in the beginning and the end of the string)
It seems that
irb(main):026:0> somestring.split(/[A-Z]{3,}/)
=> ["", " deFgHij kLmN pQrS ", " abcdEf"]
would be better for that.

Related

How to write a Ruby program using only "a-z", "A-Z", ".", "\n" and " "

I am trying to resolve some Ruby challenge and I have big problem with some low level (I suppose?) digits characters conversion.
Just imagine, that you need to write program, that prints on your screen sentence ex.:
"Jon Doe was born in 2017!"
using only following characters [a-zA-Z.\n ] (small & big letters, dot, space and new line)
In fact, I have no idea how should I even start to look for the answer.
Is it some kind of using pack / unpack method? or is there any trival solution that I can't find?
There is a reason why "write program" was bolded. Question is, what is the simplest definition of program in Ruby?
Writing a program that prints a string using only letters, dot, space and newline might seem impossible at first, but it is actually not that hard.
Lowercase and uppercase letters allow you to invoke Kernel methods (like print and puts) as well as keywords like nil, false and true. A dot allows you to invoke methods with an explicit receiver. Space allows you to pass an argument to a method. Newline separates commands.
Let's try to get an "a":
false #=> false
false.inspect #=> "false"
false.inspect.chars #=> ["f", "a", "l", "s", "e"]
false.inspect.chars.rotate #=> ["a", "l", "s", "e", "f"]
false.inspect.chars.rotate.first #=> "a"
Now lets print "abc":
print false.inspect.chars.rotate.first
print false.inspect.chars.rotate.first.succ
print false.inspect.chars.rotate.first.succ.succ
puts
Output:
abc
You get the idea.
And yes, it's also possible to print spaces, punctuation and numbers using a similar approach. But I leave that to you. Take a look at the available methods and be creative.
Additional points for figuring out how to print a string without using space, just [a-zA-Z.\n].

How does this regexp split on the first vowel?

This code splits a word into two strings at the first vowel. Why?
word = "banana"
parts = word.split(/([aeiou].*)/)
The key here is the regular expression (or regex) that is being used between the two /'s
[aeiou] says to look for the first instance of one of those characters.
. matches any single character
* modifies the previous thing to mean match 0 or more of it
(...) means capture everything enclosed between the parentheses
Translated to english this regular expression might read something like "Given a string, find the first vowel that is followed by zero or more characters. Collect that vowel and its following characters and set them aside."
The slightly more confusing part is the regex's interaction with the split method. The value the regex returns is 'anana'. And we can see that calling split with 'anana' doesn't have the same result:
'banana'.split('anana') #=> ["b"]
But when split is called with a regular expression that uses a capture group - or parentheses (...), then anything in that capture group will also be returned in the result of the split. Which is why:
'banana'.split /([aeiou].*)/ #=> ["b", "anana"]
If you want to learn more about how regular expressions work (particularly in ruby), Rubular is a great resource to fiddle with - http://www.rubular.com/r/XEUgPhOdlH
This is actually a bit tricky. This regexp
/[aeiou].*/
matches the string from the first vowel to the end of the string i.e. "anana". But if you were to split on that, you would only get the first letter since split doesn't include the splitting pattern:
"banana".split /[aeiou].*/
# ["b"]
But according to the String#split docs, if the splitting pattern is a regexp with a capture group, the capture groups are included in the result as well. Since the whole pattern is wrapped in a capture group, the result is that the string splits before the first vowel.
For example, if you change the regexp to have two capture groups, it splits further:
"banana".split /([aeiou])(.*)/
# ["b", "a", "nana"]
ANSWER FOR OLD TITLE
It's not really a Ruby's syntax, it's a standard Regular Expression's syntax that also implemented by Ruby.
* means zero or more of previous item
. means any character
[aeiou] means any character inside the brace
() means capture it
So that regex means: capture anything that starts with a, e, i, o, or u.
the word.split(/([aeiou].*)/) means, split the word variable based on anything that starts with letter a, e, i, o, or u.
See here fore more information.
ANSWER FOR NEW TITLE
Why does it split on the first vowel? It's not really like that.. What it does is, split by anything that start with vowels and capture it (the string that starts with vowels) also, see more example here:
word = 'banana'
word.split /[aeiou]/ # split by vowels
#=> ["b", "n", "n"]
word.split /([aeiou])/ # split by vowels and capture the vowels
#=> ["b", "a", "n", "a", "n", "a"]
word.split /[aeiou].*/ # split by anything that start with vowels
#=> ["b"]
word.split /([aeiou].*)/ # split by anything that start with vowels and capture the thing that start with vowels also
#=> ["b", "anana"]
ANSWER FOR OLD TITLE
If the * symbol not inside the regular expression // (Ruby's syntax), there are some possibilities:
multiplication 2 * 3 == 6, 'na' * 3 == 'nanana' # batman!
splat operation [*(1..4)] == [1,2,3,4], see more info here

How can I match Word Boundary "or" [##]?

I can't seem to get a regex that matches either a hashtag #, an #, or a word-boundary. The goal is to break a string into Twitter-like entities and topics so:
input = "Hello #world, #ruby anotherString"
input.scan(entitiesRegex)
# => ["Hello", "#world", "#ruby", "anotherString"]
To get just the words, excluding "anotherString" which is too large, is simple:
/\b\w{3,12}\b/
will return ["Hello", "world", "ruby"]. Unfortunately this doesn't include the hashtags and #s. It seems like it should work simply with:
/[\b##]\w{3,12}\b/
but that returns ["#world", "#ruby"]. This made me realize that word boundaries are not by definition a character, so they don't fall into the category of "A single character" and, so, won't match. A few more attempts:
/\b|[##]\w{3,12}\b/
returns ["", "", "#world", "", "#ruby", "", "", ""].
/((\b|[##])\w{3,12}\b)/
matches the right things, but returns [[""], ["#"], ["#"], [""]] as expected, because the braces also mean capture everything enclosed.
/((\b|[##])\w{3,12}\b)/
kind of works. It returns [["Hello", ""], ["#world", "#"], ["#ruby", "#"]]. So now all the correct items are there, they're just located at the first element of each of the subarrays. The following snippet technically works:
input.scan(/((\b|[##])\w{3,12}\b)/).collect(&:first)
Is it possible to simplify this to match and return the correct substrings with just the regular expression not requiring the collect post-processing?
You can just use the regular expression /[##]?\b\w+\b/. That is, optionally match a # or #, followed by a word boundary (in #ruby, that boundary would be between # and ruby, in a normal word it would also match at the start of the word) and a bunch of word characters.
p "Hello #world, #ruby anotherString".scan(/[##]?\b\w+\b/)
# => ["Hello", "#world", "#ruby", "anotherString"]
Furthermore, you can adjust the number of characters a matching word should have with quantifiers. You gave an example in a comment to a deleted answer to match only #ruby by using {3,4}:
p "Hello #world, #ruby anotherString".scan(/[##]?\b\w{3,4}\b/)
# => ["#ruby"]

How to use String#scan using regular expressions by digit length and keep odd remainder as element

I want to separate a string in units of three, but also have the remainder as a separate element.
def separate(string)
string.scan(/\w{3}/)
end
So, if I pass in "BENISME" I want it to return [BEN][ISM] and then also [E]
I know there is an easy answer to this, but for the life of me I just cant figure it out! What would I add to this to return the remaining E?
def separate(string)
string.scan(/\w{1,3}/) # => ["BEN", "ISM", "E"]
end
Basically, {n,m} in regexen means "any amount of times between n and m."
This will always take the maximum amount of characters it can (it will always give you three characters if possible), because regex is "greedy," and always tries to take as much as possible. Unless you use the non-greedy modifier, like so:
string.scan(/\w{1,3}?/) # => ["B", "E", "N", "I", "S", "M", "E"]
Why use regular expressions at all? You could do each_slice on the character array:
def separate(string)
string.chars.each_slice(3).map(&:join).to_a
end

How Regexp anchors \B and \b differs from each other?

I just got a bit concept about \B and \b . And accordinlgy tried a code(taken from internet)but couldn't understand that - how the output has been generated by those regexp Anchors. So any one please help me to understand the difference between \B and \b by saying internally how they approach in pattern matching in Ruby?
Interactive ruby ready.
> str = "Hit him on the head\­n" +
"Hit him on the head with a 2×4\n­"
=> "Hit him on the head
Hit him on the head with a 2??4
"
> str.scan(/­\w+\B/)
=> ["Hi", "hi", "o", "th", "hea", "Hi", "hi", "o", "th", "hea", "wit"]
> str.scan(/­\w+\b/)
=> ["Hit", "him", "on", "the", "head", "Hit", "him", "on", "the", "head", "with", "a", "2", "4"]
>
Thanks,
Like most lower/upper case pairs, they are exact opposites:
\b matches a word boundary – that is, it matches between two letters (since it’s a zero-width match, i.e. it doesn’t consume a character when matching) where one belongs to a word and the other doesn’t. In the text “this person”, \b would match the following positions (denoted by a vertical bar): “|this| |person|”.
\B matches anywhere but at a word boundary. It would match at these positions: “t|h|i|s p|e|r|s|o|n” – that is, between all letters, but not between a letter and a non-letter character.
So if you have \w+\b and match “this person“ then you get as a result “this” because + is greedy and matches as many word characters (\w) as possible, up to the next word boundary.
\w+\B operates similarly, but it cannot match “this” since that is followed by a word boundary, which \B forbids. So the engine backtracks one character and matches “thi” instead.

Resources