How Regexp anchors \B and \b differs from each other? - ruby

I just got a bit concept about \B and \b . And accordinlgy tried a code(taken from internet)but couldn't understand that - how the output has been generated by those regexp Anchors. So any one please help me to understand the difference between \B and \b by saying internally how they approach in pattern matching in Ruby?
Interactive ruby ready.
> str = "Hit him on the head\­n" +
"Hit him on the head with a 2×4\n­"
=> "Hit him on the head
Hit him on the head with a 2??4
"
> str.scan(/­\w+\B/)
=> ["Hi", "hi", "o", "th", "hea", "Hi", "hi", "o", "th", "hea", "wit"]
> str.scan(/­\w+\b/)
=> ["Hit", "him", "on", "the", "head", "Hit", "him", "on", "the", "head", "with", "a", "2", "4"]
>
Thanks,

Like most lower/upper case pairs, they are exact opposites:
\b matches a word boundary – that is, it matches between two letters (since it’s a zero-width match, i.e. it doesn’t consume a character when matching) where one belongs to a word and the other doesn’t. In the text “this person”, \b would match the following positions (denoted by a vertical bar): “|this| |person|”.
\B matches anywhere but at a word boundary. It would match at these positions: “t|h|i|s p|e|r|s|o|n” – that is, between all letters, but not between a letter and a non-letter character.
So if you have \w+\b and match “this person“ then you get as a result “this” because + is greedy and matches as many word characters (\w) as possible, up to the next word boundary.
\w+\B operates similarly, but it cannot match “this” since that is followed by a word boundary, which \B forbids. So the engine backtracks one character and matches “thi” instead.

Related

Ruby regular expressions for finding words

I'm quite new to regular expressions. I am using the regular expression:
/\w+/
To check for words, and it's obvious that this will have problems with punctuation, but I'm not quite sure how to change this regular expression. For example, when I run this command from a class I made:
Wordify.new.regex(/\w+/).string("This sentence isn't 'the best-example, isn't it not?...").display
I get the output:
-----------
this: 1
sentence: 1
isn: 2
t: 2
the: 1
best: 1
example: 1
it: 1
not: 1
-----------
How can I adjust the regular expression so that it matches words with apostrophes, like: isn't as one word, but will only match the when searching 'the or the'. Hyphens in the middle of a word like stack-overflow should match return stack and overflow separately, which this already does.
Additionally, words shouldn't be able to start or end with numbers, like test1241 or 436test should become test, but te7st is okay. Plain numbers should not be recognised.
Sorry, I know this is a big ask, but I'm not sure where to start with regex. Would be grateful if you could also explain what the expression means if possible.
str = "This is 2a' 4test' of my agréable re4'gex, n'est-ce pas?"
r = /
[[:alpha:]] # match a letter
(?: # begin the outer non-capture group
(?:[[:alpha:]]|\d|') # match a letter, digit or apostrophe in a non-capture group
* # execute the above non-capture group zero or more times
[[:alpha:]] # match a letter
)? # close the outer non-capture group and make it optional
/x # free-spacing regex definition mode
str.scan r
#=> ["This", "is", "a", "test", "of", "my", "agréable", "re4'gex", "n'est", "ce", "pas"]
Note the outer capture group is needed in case the string to be matched is a single character.
Hmmm. Maybe we should add a hyphen to the inner non-capture group.
r = /[[:alpha:]](?:(?:[[:alpha:]]|\d|'|-)*[[:alpha:]])?/
str.scan r
#=> ["This", "is", "a", "test", "of", "my", "agréable", "re4'gex", "n'est-ce", "pas"]
I now rarely use the word-matching character \w, mainly because it matches the underscore, as well as letters and digits. Instead I reach for a POSIX bracket expression (search "POSIX"), which has the added (perhaps primary) benefit that it is not English-centric. For example, matching a word character with the exception of an underscore is [[:alnum:]].
You can do something basic using:
/[a-z]+(?:'[a-z]+)*/i
To extend it to allow words like a2b and avoid 123abc abc123 and or plain numbers:
/[a-z]+(?:'[a-z]+|\d+[a-z]+)*/i
There's no special regex features used in the two patterns, only basics.
Try scanning the string using the [[:alpha:]] POSIX character class:
s = "This a sentence isn't 'the best-example, isn't it not?... a1 2b 3c3 d4d 555 stack-overflow"
s.scan(/[[:alpha:]](?:['\w]*[[:alpha:]])?/)
# => ["This", "a", "sentence", "isn't", "the", "best", "example", "isn't", "it", "not", "a", "b", "c", "d4d", "stack", "overflow"]
[First attempt]
I split the string into tokens separated by whitespace or hyphens then clean up each token per your rules, since it seems like they might be adjusted as you refine your problem:
def tokenize(str)
tokens = str.split(/(?:\s+|-)/)
tokens.reduce([]) do |memo, token|
token.gsub!(/(^\W+|\W+$)/, '') # Strip enclosing non-words
token.gsub!(/(^\d+|\d+$)/, '') # Strip enclosing digits
memo + (token=='' ? [] : [token]) # Ignore the empty string
end
end
s = "This sentence isn't 'the best-example, isn't it not?... a1 2b 3c3 d4d 555 stack-overflow"
puts tokenize(s).inspect
# ["This", "sentence", "isn't", "the", "best", "example", "isn't", "it", "not", "a", "b", "c", "d4d", "stack", "overflow"]
Clearly this solution doesn't use just regular expressions but for my money it's much easier to understand and modify then (what I imagine) a big regex would look like!

Split string on capital letters, but not if preceded by whitespace

I have a string that looks like
"AaaBbbCcc DddEee"
I'm splitting it with
my_string.scan(/[A-Z][a-z]+/)
and the result is
["Aaa", "Bbb", "Ccc", "Ddd", "Eee"]
What I'd like to achieve is to not split the string if the capital letter is preceded by a white space, so the result would look like
["Aaa", "Bbb", "Ccc Ddd", "Eee"]
my_string.split(/(?<!\s)(?=[A-Z])/)
This matches positions that are not preceded by a whitespace (negative lookbehind - (?<!\s)) and are followed by a capital letter (positive lookahead - (?=[A-Z])).
If you do not need to split or if the number of spaces in between the desired matches can be different, you may use your own approach and match additionally zero or more sequences of whitespace(s) + [A-Z][a-z]+ by adding (?:\s+[A-Z][a-z]+)* subpattern:
my_string.scan(/[A-Z][a-z]+(?:\s+[A-Z][a-z]+)*/)
See the Ruby demo
To shorten it a bit, you may build the regex dynamically (see demo here):
my_string = 'AaaBbbCcc DddEee'
block = "[A-Z][a-z]+"
puts my_string.scan(/#{block}(?:\s+#{block})*/)
And here is a Unicode-friendly version of the above regex (online demo):
my_string.scan(/\p{Lu}\p{Ll}+(?:\s+\p{Lu}\p{Ll}+)*/)
where \p{Lu} matches any uppercase letter and \p{Ll} matches any lowercase letter.

Ruby Regular expression not matching properly

I am trying to creat a RegEx to find words that contains any vowel.
so far i have tried this
/(.*?\S[aeiou].*?[\s|\.])/i
but i have not used RegEx much so its not working properly.
for example if i input "test is 1234 and sky fly test1234"
it should match test , is, and, test1234 but showing
test, is,1234 and
if put something else then different output.
Alternatively you can also do something like:
"test is 1234 and sky fly test1234".split.find_all { |a| a =~ /[aeiou]/ }
# => ["test", "is", "and", "test1234"]
You could use the below regex.
\S*[aeiou]\S*
\S* matches zero or more non-space characters.
or
\w*[aeiou]\w*
It will solve:
\b\w*[aeiou]+\w*\b
https://www.debuggex.com/r/O-fU394iC5ErcSs7
or you can substitute \w by \S
\b\S*[aeiou]+\S*\b
https://www.debuggex.com/r/RNE6Y6q1q5yPJbe-
\b - a word boundary
\w - same as [_a-zA-Z0-9]
\S - a non-whitespace character
Try this:
\b\w*[aeiou]\w*\b
\b denotes a word boundry, so this regexp matches word bounty, zero or more letters, a vowel, zero or more letters and another word boundry

Regex find 'a' or 'an' in sentence in Ruby

I am beginner in Regex. I thought I would complete this without help but couldn't.
I want to find article word pair from following sentence(where article must be A or An):
This is a sentence. An egg is a word. A gee another word.
Last line is a word. Ocean is very big.
I used this regex pattern:
/[(An)|(an)|a|A]\s+\w+[\s|.]/
Captured pairs are:
'a sentence.', 'n egg ', 'a word.', 'A gee ', 'a word.', 'n is '.
Above pattern couldn't capture An egg fully. However, more strangely it captured 'n is ' in Ocean is.
What could be correct pattern to extract it?
Add a word boundary:
/\b(an?)\s+\w+/i
Edit: (n mustn't be capital)
/\b([aA]n?)\s+\w+/
s = 'This is a sentence. An egg is a word. A gee another word.\nLast line is a word. Ocean is very big.'
s.scan /(?<=\A|\s)[Aa]n?\s+[A-Za-z]+/m
# => [
# [0] "a sentence",
# [1] "An egg",
# [2] "a word",
# [3] "A gee",
# [4] "a word"
# ]
Here we go: /(?<=\A|\s)[Aa]n?\s+[A-Za-z]+/m
First is a lookbehind for not matching “an is” in “Ocean is.” Then we look for A (maybe capital), possibly followed by “n”, then spaces and word itself. Final m states for multiline.
To avoid using lookbehind, one may change the regexp to:
/\b[Aa]n?\s+[A-Za-z]+/m
UPD One should avoid using \w here since \w matches [A-Za-z0-9_] note especially the underscore.
Try simplifying to \b(An|an|a|A) \w+\b.
I'd use a very simple pattern, along with scan to find all occurrences:
sentence = <<EOT
This is a sentence. An egg is a word. A gee another word.
Last line is a word. Ocean is very big.
EOT
sentence.scan(/\b an? \s+ [a-z]+/imx)
# => ["a sentence", "An egg", "a word", "A gee", "a word"]
I'm using the x flag to improve the readability of the pattern.
The pattern breaks down to:
\b: a word-boundary so only "a" or "an" match. (It's case insensitive.)
an?: matches "a" or "an".
\s+: matches one or more white-spaces.
[a-z]+: matches consecutive runs of letters only. This is significant because any pattern using the \w character-class would also match 0..9 and "_" (underscore). Your sample doesn't contain those, but any text containing those characters would be likely to give you bad results.
The i flag means ignore case. The m flag means to treat the text as a single line of text. Normally line-ends are more significant. x means that white-spaces in the pattern are not significant, requiring \s to mark where they should be.
If you want the trailing punctuation or space, add . to the end of the pattern:
sentence.scan(/\b an? \s+ [a-z]+ ./imx)
# => ["a sentence.", "An egg ", "a word.", "A gee ", "a word."]

Regex for finding at most n consecutive patterns

Lets say our pattern is a regex for capital letters (but we could have a more complex pattern than searching for capitals)
To find at least n consecutive patterns (in this case, the pattern we are looking for is simply a capital letter), we can do this:
(Using Ruby)
somestring = "ABC deFgHij kLmN pQrS XYZ abcdEf"
at_least_2_capitals = somestring.scan(/[A-Z][A-Z]+/)
=> ["ABC", "XYZ"]
at_least_3_capitals = somestring.scan(/[A-Z]{3}[A-Z]*/)
=> ["ABC", "XYZ"]
However, how do I search for at most n consecutive patterns, for example, at most one consecutive capital letter:
matches = somestring.scan(/ ??? /)
=> [" deFgHij kLmN pQrS ", " abcdEf"]
Detailed strategy
I read that I need to negate the "at least" regex, by turning it into a DFA, negating the accept states, (then converting it back to NFA, though we can leave it as it is) so to write it as a regex. If we think of encountering our pattern as receiving a '1' and not receiving the pattern as receiving a '0', we can draw a simple DFA diagram (where n=1, we want at most one of our pattern):
Specifically, I was wondering how this becomes a regex. Generally, I hope to find how to find "at most" with regex, as my regex skills feel stunted with "at least" alone.
Trip Hazards - not quite the right solution in spirit
Note that this question is not a dupicate of this post, as using the accepted methodology there would give:
somestring.scan(/[A-Z]{2}[A-Z]*(.*)[A-Z]{2}[A-Z]*/)
=> [[" deFgHij kLmN pQrS X"]]
Which is not what the DFA shows, not just because it misses the second sought match - more importantly that it includes the 'X', which it should not, as 'X' is followed by another capital, and from the DFA we see that a capital which is followed by another capital is not an accept state.
You could suggest
somestring.split(/[A-Z]{2}[A-Z]*/)
=> ["", " deFgHij kLmN pQrS ", " abcdEf"]
(Thanks to Rubber Duck)
but I still want to know how to find at most n occurrences using regex alone. (For knowledge!)
Why your attempt does not work
There are a few problems with your current attempt.
The reason that X is part of the match is that .* is greedy and consumes as much as possible - hence, leaving only the required two capital letters to be matched by the trailing bit. This could be fixed with a non-greedy quantifier.
The reason why you don't get the second match is twofold. First, you require two trailing capital letters to be there, but instead there is the end of the string. Second, matches cannot overlap. The first match includes at least two trailing capital letters, but the second would need to match these again at the start which is not possible.
There are more hidden problems: try an input with four consecutive capital letters - it can give you an empty match (provided you use the non-greedy quantifier - the greedy one has even worse problems).
Fixing all of these with the current approach is hard (I tried and failed - check the edit history of this post if you want to see my attempt until I decided to scrap this approach altogether). So let's try something else!
Looking for another solution
What is it that we want to match? Disregarding the edge cases, where the match starts at the beginning of the string or ends at the end of the string, we want to match:
(non-caps) 1 cap (non-caps) 1 cap (non-caps) ....
This is ideal for Jeffrey Friedl's unrolling-the-loop. Which looks like
[^A-Z]+(?:[A-Z][^A-Z]+)*
Now what about the edge cases? We can phrase them like this:
We want to allow a single capital letter at the beginning of the match, only if it's at the beginning of the string.
We want to allow a single capital letter at the end of the match, only if it's at the end of the string.
To add these to our pattern, we simply group a capital letter with the appropriate anchor and mark both together as optional:
(?:^[A-Z])?[^A-Z]+(?:[A-Z][^A-Z]+)*(?:[A-Z]$)?
Now it's really working. And even better, we don't need capturing any more!
Generalizing the solution
This solution is easily generalized to the case of "at most n consecutive capital letters", by changing each [A-Z] to [A-Z]{1,n} and thereby allowing up to n capital letters where there is only one allowed so far.
See the demo for n = 2.
tl;dr
To match words containing at most N PATTERNs, use the regex
/\b(?:\w(?:(?<!PATTERN)|(?!(?:PATTERN){N,})))+\b/
For example, to match words containing at most 1 capital letter,
/\b(?:\w(?:(?<![A-Z])|(?!(?:[A-Z]){1,})))+\b/
This works for multi-character patterns too.
Clarification Needed
I'm afraid your examples may cause confusion. Let's add a few words:
somestring = "ABC deFgHij kLmN pQrS XYZ abcdEf mixedCaps mixeDCaps mIxedCaps mIxeDCaps T TT t tt"
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
Now, rerunning your at-least-2-capitals regex returns
at_least_2_capitals = somestring.scan(/[A-Z][A-Z]+/)
=> ["ABC", "XYZ", "DC", "DC", "TT"]
Note how complete words are not captured! Are you sure this is what you wanted? I ask, of course, because in your latter examples, your at-most-1-capital regex returns complete words, instead of just the capital letters being captured.
Solution
Here's the solution either way.
First, for matching just patterns (and not entire words, as consistent with your initial examples), here's a regex for at-most-N-PATTERNs:
/(?<!PATTERN)(?!(?:PATTERN){N+1,})(?:PATTERN)+/
For example, the at-most-1-capitals regex would be
/(?<![A-Z])(?!(?:[A-Z]){2,})(?:[A-Z])+/
and returns
=> ["F", "H", "L", "N", "Q", "S", "E", "C", "DC", "I", "C", "I", "DC", "T", "TT"]
To further exemplify, the at-most-2-capitals regex returns
=>
Finally, if you wanted to match entire words that contained at most a certain number of consecutive patterns, then here's a whole different approach:
/\b(?:\w(?:(?<![A-Z])|(?![A-Z]{1,})))+\b/
This returns
["deFgHij", "kLmN", "pQrS", "abcdEf", "mixedCaps", "mIxedCaps", "T", "t", "tt"]
The general form is
/\b(?:\w(?:(?<!PATTERN)|(?!(?:PATTERN){N,})))+\b/
You can see all these examples at http://ideone.com/hImmZr.
to find "at most" with a regex, you use the suffix {1,n} (possibly preceded by a negative lookbehind and followed by a positive lookahead), so it seems that what you want is:
irb(main):006:0> somestring.scan(/[A-Z]{1,2}/)
=> ["AB", "C", "F", "H", "L", "N", "Q", "S", "XY", "Z", "E"]
or
irb(main):007:0> somestring.scan(/(?<![A-Z])[A-Z]{1,2}(?![A-Z])/)
=> ["F", "H", "L", "N", "Q", "S", "E"]
EDIT: if the OP still wants "the longest strings not including more than two uppercase letters", it can use:
irb(main):025:0> somestring.scan(/[^A-Z]+(?:[A-Z]{1,2}[^A-Z]+)*/)
=> [" deFgHij kLmN pQrS ", " abcdEf"]
(but that regex possibly won't match in the beginning and the end of the string)
It seems that
irb(main):026:0> somestring.split(/[A-Z]{3,}/)
=> ["", " deFgHij kLmN pQrS ", " abcdEf"]
would be better for that.

Resources