Ruby Regular expression not matching properly - ruby

I am trying to creat a RegEx to find words that contains any vowel.
so far i have tried this
/(.*?\S[aeiou].*?[\s|\.])/i
but i have not used RegEx much so its not working properly.
for example if i input "test is 1234 and sky fly test1234"
it should match test , is, and, test1234 but showing
test, is,1234 and
if put something else then different output.

Alternatively you can also do something like:
"test is 1234 and sky fly test1234".split.find_all { |a| a =~ /[aeiou]/ }
# => ["test", "is", "and", "test1234"]

You could use the below regex.
\S*[aeiou]\S*
\S* matches zero or more non-space characters.
or
\w*[aeiou]\w*

It will solve:
\b\w*[aeiou]+\w*\b
https://www.debuggex.com/r/O-fU394iC5ErcSs7
or you can substitute \w by \S
\b\S*[aeiou]+\S*\b
https://www.debuggex.com/r/RNE6Y6q1q5yPJbe-
\b - a word boundary
\w - same as [_a-zA-Z0-9]
\S - a non-whitespace character

Try this:
\b\w*[aeiou]\w*\b
\b denotes a word boundry, so this regexp matches word bounty, zero or more letters, a vowel, zero or more letters and another word boundry

Related

Regex: Match all hyphens or underscores not at the beginning or the end of the string

I am writing some code that needs to convert a string to camel case. However, I want to allow any _ or - at the beginning of the code.
I have had success matching up an _ character using the regex here:
^(?!_)(\w+)_(\w+)(?<!_)$
when the inputs are:
pro_gamer #matched
#ignored
_proto
proto_
__proto
proto__
__proto__
#matched as nerd_godess_of, skyrim
nerd_godess_of_skyrim
I recursively apply my method on the first match if it looks like nerd_godess_of.
I am having troubled adding - matches to the same, I assumed that just adding a - to the mix like this would work:
^(?![_-])(\w+)[_-](\w+)(?<![_-])$
and it matches like this:
super-mario #matched
eslint-path #matched
eslint-global-path #NOT MATCHED.
I would like to understand why the regex fails to match the last case given that it worked correctly for the _.
The (almost) full set of test inputs can be found here
The fact that
^(?![_-])(\w+)[_-](\w+)(?<![_-])$
does not match the second hyphen in "eslint-global-path" is because of the anchor ^ which limits the match to be on the first hyphen only. This regex reads, "Match the beginning of the line, not followed by a hyphen or underscore, then match one or more words characters (including underscores), a hyphen or underscore, and then one or more word characters in a capture group. Lastly, do not match a hyphen or underscore at the end of the line."
The fact that an underscore (but not a hyphen) is a word (\w) character completely messes up the regex. In general, rather than using \w, you might want to use \p{Alpha} or \p{Alnum} (or POSIX [[:alpha:]] or [[:alnum:]]).
Try this.
r = /
(?<= # begin a positive lookbehind
[^_-] # match a character other than an underscore or hyphen
) # end positive lookbehind
( # begin capture group 1
(?: # begin a non-capture group
-+ # match one or more hyphens
| # or
_+ # match one or more underscores
) # end non-capture group
[^_-] # match any character other than an underscore or hyphen
) # end capture group 1
/x # free-spacing regex definition mode
'_cats_have--nine_lives--'.gsub(r) { |s| s[-1].upcase }
#=> "_catsHaveNineLives--"
This regex is conventionally written as follows.
r = /(?<=[^_-])((?:-+|_+)[^_-])/
If all the letters are lower case one could alternatively write
'_cats_have--nine_lives--'.split(/(?<=[^_-])(?:_+|-+)(?=[^_-])/).
map(&:capitalize).join
#=> "_catsHaveNineLives--"
where
'_cats_have--nine_lives--'.split(/(?<=[^_-])(?:_+|-+)(?=[^_-])/)
#=> ["_cats", "have", "nine", "lives--"]
(?=[^_-]) is a positive lookahead that requires the characters on which the split is made to be followed by a character other than an underscore or hyphen
you can try the regex
^(?=[^-_])(\w+[-_]\w*)+(?=[^-_])\w$
see the demo here.
Switch _- to -_ so that - is not treated as a range op, as in a-z.

Split string on capital letters, but not if preceded by whitespace

I have a string that looks like
"AaaBbbCcc DddEee"
I'm splitting it with
my_string.scan(/[A-Z][a-z]+/)
and the result is
["Aaa", "Bbb", "Ccc", "Ddd", "Eee"]
What I'd like to achieve is to not split the string if the capital letter is preceded by a white space, so the result would look like
["Aaa", "Bbb", "Ccc Ddd", "Eee"]
my_string.split(/(?<!\s)(?=[A-Z])/)
This matches positions that are not preceded by a whitespace (negative lookbehind - (?<!\s)) and are followed by a capital letter (positive lookahead - (?=[A-Z])).
If you do not need to split or if the number of spaces in between the desired matches can be different, you may use your own approach and match additionally zero or more sequences of whitespace(s) + [A-Z][a-z]+ by adding (?:\s+[A-Z][a-z]+)* subpattern:
my_string.scan(/[A-Z][a-z]+(?:\s+[A-Z][a-z]+)*/)
See the Ruby demo
To shorten it a bit, you may build the regex dynamically (see demo here):
my_string = 'AaaBbbCcc DddEee'
block = "[A-Z][a-z]+"
puts my_string.scan(/#{block}(?:\s+#{block})*/)
And here is a Unicode-friendly version of the above regex (online demo):
my_string.scan(/\p{Lu}\p{Ll}+(?:\s+\p{Lu}\p{Ll}+)*/)
where \p{Lu} matches any uppercase letter and \p{Ll} matches any lowercase letter.

Regex to find strings with only letters or numbers or both

I am searching for strings with only letters or numbers or both. How could I write a regex for that?
You can use following regex to check if the string contains letters and/or numbers
^[a-zA-Z0-9]+$
Explanation
^: Starts with
[]: Character class
a-zA-Z: Matches any alphabet
0-9: Matches any number
+: Matches previous characters one or more time
$: Ends with
RegEx101 Demo
"abc&#*(2743438" !~ /[^a-z0-9]/i # => false
"abc2743438" !~ /[^a-z0-9]/i # => true
This example let to avoid multiline anchors use (^ or $) (which may present a security risk) so it's better to use \A and \z, or to add the :multiline => true option in Rails.
Only letters and numbers:
/\A[a-zA-Z0-9]+\z/
Or if you want to leave - and _ chars also:
/\A[a-zA-Z0-9_\-]+\z/

Regex find 'a' or 'an' in sentence in Ruby

I am beginner in Regex. I thought I would complete this without help but couldn't.
I want to find article word pair from following sentence(where article must be A or An):
This is a sentence. An egg is a word. A gee another word.
Last line is a word. Ocean is very big.
I used this regex pattern:
/[(An)|(an)|a|A]\s+\w+[\s|.]/
Captured pairs are:
'a sentence.', 'n egg ', 'a word.', 'A gee ', 'a word.', 'n is '.
Above pattern couldn't capture An egg fully. However, more strangely it captured 'n is ' in Ocean is.
What could be correct pattern to extract it?
Add a word boundary:
/\b(an?)\s+\w+/i
Edit: (n mustn't be capital)
/\b([aA]n?)\s+\w+/
s = 'This is a sentence. An egg is a word. A gee another word.\nLast line is a word. Ocean is very big.'
s.scan /(?<=\A|\s)[Aa]n?\s+[A-Za-z]+/m
# => [
# [0] "a sentence",
# [1] "An egg",
# [2] "a word",
# [3] "A gee",
# [4] "a word"
# ]
Here we go: /(?<=\A|\s)[Aa]n?\s+[A-Za-z]+/m
First is a lookbehind for not matching “an is” in “Ocean is.” Then we look for A (maybe capital), possibly followed by “n”, then spaces and word itself. Final m states for multiline.
To avoid using lookbehind, one may change the regexp to:
/\b[Aa]n?\s+[A-Za-z]+/m
UPD One should avoid using \w here since \w matches [A-Za-z0-9_] note especially the underscore.
Try simplifying to \b(An|an|a|A) \w+\b.
I'd use a very simple pattern, along with scan to find all occurrences:
sentence = <<EOT
This is a sentence. An egg is a word. A gee another word.
Last line is a word. Ocean is very big.
EOT
sentence.scan(/\b an? \s+ [a-z]+/imx)
# => ["a sentence", "An egg", "a word", "A gee", "a word"]
I'm using the x flag to improve the readability of the pattern.
The pattern breaks down to:
\b: a word-boundary so only "a" or "an" match. (It's case insensitive.)
an?: matches "a" or "an".
\s+: matches one or more white-spaces.
[a-z]+: matches consecutive runs of letters only. This is significant because any pattern using the \w character-class would also match 0..9 and "_" (underscore). Your sample doesn't contain those, but any text containing those characters would be likely to give you bad results.
The i flag means ignore case. The m flag means to treat the text as a single line of text. Normally line-ends are more significant. x means that white-spaces in the pattern are not significant, requiring \s to mark where they should be.
If you want the trailing punctuation or space, add . to the end of the pattern:
sentence.scan(/\b an? \s+ [a-z]+ ./imx)
# => ["a sentence.", "An egg ", "a word.", "A gee ", "a word."]

Regex: don't match if string contains whitespace

I can't seem to figure out the regex pattern for matching strings only if it doesn't contain whitespace. For example
"this has whitespace".match(/some_pattern/)
should return nil but
"nowhitespace".match(/some_pattern/)
should return the MatchData with the entire string. Can anyone suggest a solution for the above?
In Ruby I think it would be
/^\S*$/
This means "start, match any number of non-whitespace characters, end"
You could always search for spaces, an then negate the result:
"str".match(/\s/).nil?
>> "this has whitespace".match(/^\S*$/)
=> nil
>> "nospaces".match(/^\S*$/)
=> #<MatchData "nospaces">
^ = Beginning of string
\S = non-whitespace character, * = 0 or more
$ = end of string
Not sure you can do it in one pattern, but you can do something like:
"string".match(/pattern/) unless "string".match(/\s/)
"nowhitespace".match(/^[^\s]*$/)
You want:
/^\S*$/
That says "match the beginning of the string, then zero or more non-whitespace characters, then the end of the string." The convention for pre-defined character classes is that a lowercase letter refers to a class, while an uppercase letter refers to its negation. Thus, \s refers to whitespace characters, while \S refers to non-whitespace.
str.match(/^\S*some_pattern\S*$/)

Resources