Ruby Regular expressions (regex): character appear only once at most - ruby

Suppose I want to make sure a string x equals any combination of abcd (each character appearing one or zero times-->each character should not repeat, but the combination may appear in any order)
valid ex: bc .. abcd ... bcad ... b... d .. dc
invalid ex. abcdd, cc, bbbb, abcde (ofcourse)
my effort:
I tried various techniques:
the closest I came was
x =~ ^(((a)?(b)?(c)?(d)?))$
but this wont work if I do not type them in the same order as i have written:
works for: ab, acd, abcd, a, d, c
wont work for: bcda, cb, da (anything that is not in the above order)
you can test your solutions here : http://rubular.com/r/wCpD355bub
PS: the characters may not be in alphabetical order, it could be u c e t

If you can use things besides regexes, you can try:
str.chars.uniq.length == str.length && str.match(/^[a-d]+$/)
The general idea here is that you just strip any duplicated characters from the string, and if the length of the uniq array is not equal to the length of the source string, you have a duplicated character in the string. The regex then enforces the character set.
This can probably be improved, but it's pretty straightforward. It does create a couple of extra arrays, so you might want a different approach if this needs to be used in a performance-critical location.
If you want to stick to regexes, you could use:
str.match(/^[a-d]+$/) && !str.match(/([a-d]).*\1/)
That'll basically check that the string only contains the allowed characters, and that those characters are never repeated.

This is really not what regular expressions are meant to do, but if you really really want to.
Here is a regex that satisfies the conditions.
^([a-d])(?!(\1))([a-d])?(?!(\1|\3))([a-d])?(?!(\1|\3|\5))([a-d])?(?!(\1|\3|\5|\7))$
basically it goes through each character, making the group, then makes sure that that group isn't matched. Then checks the next character, and makes sure that group and the previous groups don't match.

You can reverse it (match the condition that would make it fail)
re = /^ # start of line
(?=.*([a-d]).*\1) # match if a letter appears more than once
| # or
(?=.*[^a-d]) # match if a non abcd char appears
/x
puts 'fail' if %w{bc abcd bcad b d dc}.any?{|s| s =~ re}
puts 'fail' unless %w{abcdd cc bbbb abcde}.all?{|s| s =~ re}

I don't think regexes are well suited to this problem, so here is another non-regex solution. It's recursive:
def match_chars_no_more_than_once(characters, string)
return true if string.empty?
if characters.index(string[0])
match_chars_no_more_than_once(characters.sub(string[0],''), string[1..-1])
else
false
end
end
%w{bc bdac hello acbbd cdda}.each do |string|
p [string, match_chars_no_more_than_once('abcd', string)]
end
Output:
["bc", true]
["bdac", true]
["hello", false]
["acbbd", false]
["cdda", false]

Related

How to use regular expressions to match numbers with some exceptions

I need to match numbers in groups of 5, from 1 to 5, with the following exceptions:
Numbers can't include zeros
Numbers can't be like 11111, 22222 and so on.
Numbers can't be like 12345 or 54321
Some examples of valid numbers:
14252, 45121, 43412, 51321 ...
So far I got an expression to group the numbers and do not allow zeros.
/[1-5]{5}/
But I'm having some trouble to handle the second and third exceptions. I tried unsuccessfully to use a negative lookahead to disallow a match if I have a pattern of repeated numbers.
?!11111|?!22222
I'm trying with this expression:
((?!11111)[1-5]{5}?)
How can I write regular expressions to not match certain patterns?
I will eventually change it to not match any other sequence of numbers.
First off, you don't have to cram everything into one regex. Regexes are already complicated, if you can do it in multiple regexes, that will often make things much simpler and allow for more flexible code. For example, you can customize the error message based on which condition failed. Usually you only need to fold multiple regexes together for performance reasons, and there are tools to do that automatically.
So far I got an expression to group the numbers and do not allow zeros.
/[1-5]{5}/
Careful, you have to anchor at both ends that else it will accept any string that contains a run of 5 of 1-5.
/\A[1-5]{5}\z/
Numbers can't be like 11111, 22222 and so on.
Use a capture within the regex to accomplish this. Capture the first number, then see if there's four more. () to capture and \1 to refer to what was captured.
/\A([1-5])\1{4}\z/'
Numbers can't be like 12345 or 54321
/\A(?:12345|54321)\z/
Here's a solution that does not use a regular expression. I understand we are to determine if: a) the string contains five characters; b) each character equals '1', '2', '3', '4' or '5'; c) the string contains at least two different characters; and d) the string is neither '12345' nor '54321'. We can do that as follows.
def is_ok?(str)
str.size == 5 && # five characters
(str.chars - ['1','2','3','4','5']).empty? && # only the digits '1'-'5'
str.squeeze.size > 1 && # not all the same character
str != '12345' && # not an increasing sequence
str != '54321' # not a decreasing sequence
end
is_ok? '12543' #=> true
is_ok? '12043' #=> false
is_ok? '12643' #=> false
is_ok? '22222' #=> false
is_ok? '12345' #=> false
is_ok? '54321' #=> false
You have the right idea using negative lookaheads, just the syntax was a little off. This works for me:
\A(?!11111|22222|33333|44444|55555|12345|54321)[1-5]{5}\z
How about this?
^(?!([1-5])\1{4})(?!54321)(?!12345)[1-5]{5}$

At which position does the regex fail?

I need a very simple string validator that would show where is first symbol not corresponding to the desired format. I want to use regex but in this case I have to find the place where the string stops corresponding to the expression and I can't find a method that would do that.
(It's got to be a fairly simple method... maybe there isn't one?)
For example if I have regex:
/^Q+E+R+$/
with string:
"QQQQEEE2ER"
The desired result should be 7
An idea: what you can do is to tokenize your pattern and write it with optional nested capturing groups:
^(Q+(E+(R+($)?)?)?)?
Then you only need to count the number of capture groups you obtain to know where the regex engine stops in the pattern and you can determine the offset of the match end in the string with the whole match length.
As #zx81 notices it in his comment, if one of the elements can match the next element (example Q can match the element E), things become different.
Let's say that Q is \w (and can match E and R). For the string QQQEEERRR the precedent pattern will give only one capturing group (the greedy \w+ matches all) when ^(\w+)(E+)(R+)$ will give three groups: QQQEE, E, RRR
To obtain the same result you need to add an alternation:
^((?:\w+(?=E)|\w+)(E+(R+($)?)?)?)?
In the alternation, the case where E exists must be tested first, and only if this branch fails (with the lookahead), then the other branch where E doesn't exist is used.
Thus the full pattern can be rewritten like this to deal with this specific case:
^((?:Q+(?=E)|Q+)((?:E+(?=R)|E+)((?:R+(?=$)|R+)($)?)?)?)?
Perhaps could you take a look to the gem amatch too.
This is an interesting task that can be accomplished with a neat regex trick:
^(?:(?=(Q+)))?(?:(?=(Q+E+)))?(?:(?=(Q+E+R+)))?(?:(?=(Q+E+R+$)))?
We have four optional lookaheads checking various parts of the pattern and capturing the partial matches to Groups 1, 2, 3 and 4 incrementally.
Group 1 contains Q+ if it can be matched, in your example QQQQ.
Group 2 contains Q+E+ if it can be matched, in your example EEE.
Group 3 contains Q+E+R+ if it can be matched, in your example nil.
Group 3 contains Q+E+R+$ if it can be matched, in your example nil.
In your code, check which is the last Group that is set by testing !$1.nil?, !$2.nil? and so on.
The last one set gives you the length that is matchable, so in your example $2.length gives you the 7 you wanted.
Incidentally, the fact that Group 2 is the last one set also tells you that we fail on R+.
For your example, you could do the following.
Code
Change your regex from:
/^Q+E+R+$/
to
R = /^(Q*)(E*)(R*)/
and then apply the following method to the string:
def nbr_matched_chars(str)
str.scan(R).flatten.reduce(0) {|t,e| return t if e.nil?; t+e.size }
end
str matches the original regex if and only if nbr_matched_chars(str) == str.size.
Examples
nbr_matched_chars("QQQQEEE2ER") #=> 7
nbr_matched_chars("QQQQEEEERR") #=> 10 (= "QQQQEEEERR".size)
nbr_matched_chars("QQAQQEEEER") #=> 2
Explanation
To see why this [evidently :-)] works, we can look at the results of invoking String#scan, followed by Array#flatten:
"QQQQEEE2ER".scan(r).flatten #=> ["QQQQ", "EEE" , nil ]
"QQQQEEEERR".scan(r).flatten #=> ["QQQQ", "EEEE", "RR"]
"QQAQQEEEER".scan(r).flatten #=> ["QQ" , nil , nil ]

Using Regexp to check whether a string starts with a consonant

Is there a better way to write the following regular expression in Ruby? The first regex matches a string that begins with a (lower case) consonant, the second with a vowel.
I'm trying to figure out if there's a way to write a regular expression that matches the negative of the second expression, versus writing the first expression with several ranges.
string =~ /\A[b-df-hj-np-tv-z]/
string =~ /\A[aeiou]/
The statement
$string =~ /\A[^aeiou]/
will test whether the string starts with a non-vowel character, which includes digits, punctuation, whitespace and control characters. That is fine if you know beforehand that the string begins with a letter, but to check that it starts with a consonant you can use forward look-ahead to test that it starts with both a letter and a non-vowel, like this
$string =~ /\A(?=[^aeiou])(?=[a-z])/i
To match an arbitrary number of consonants, you can use the sub-expression (?i:(?![aeiou])[a-z]) to match a consonant. It is atomic, so you can put a repetition count like {3} right after it. For example, this program finds all the strings in a list that contain three consonants in a row
list = %w/ aab bybt xeix axei AAsE SAEE eAAs xxsa Xxsr /
puts list.select { |word| word =~ /\A(?i:(?![aeiou])[a-z]){3}/ }
output
bybt
xxsa
Xxsr
I modified the answer provided by #Alexander Cherednichenko in order to get rid of the if statements.
/^[^aeiou\W]/i.match(s) != nil
If you want to catch a string that doesn't start with vowels, but only starts with consonants you can use this code below. It returns true if a string starts with any letter other than A, E, I, O, U. s is any string we give to a function
if /^[^aeiou\W]/i.match(s) == nil
return false
else
return true
end
i added at the end to make regular expression case insensitive.
\W is used to catch any non-word character, for example if a string starts with a digit like: "1something"
[^aeiou] means a range of character except a e i o u
And we put ^ at the beginning before [ to indicate that the following range [^aeiou\W] if for the 1st character
Note that ^[^aeiou\W] pattern is not correct because it also matches a line that starts with a digit, or underscore. Borodin's solution is working well, but there is one more possible solution without lookaheads, based on character class subtraction (more here) and using the more contemporary Regexp#match?:
/\A[a-z&&[^aeiou]]/i.match?(word)
See the Rubular demo.
Details
\A - start of a string (^ in Ruby is start of any line)
[a-z&&[^aeiou]] - an a-z character range matching any ASCII letter (/i flag makes it case insensitive) except for the aeiou chars.
See the Ruby demo:
test = %w/ 1word _word ball area programming /
puts test.select { |w| /\A[a-z&&[^aeiou]]/i.match?(w) }
# => ['ball', 'programming']

How to know if a match is adjacent to the previous match

In a construction like
string.scan(regex){...}
or
string.gsub(regex){...}
how can check if the match for a loop cycle is adjacent to the previous one in the original string? For example, in
"abaabcaaab".scan(/a+b/){|match|
...
continued = ...
...
}
there will be three matches "ab", "aab", and "aaab". During each cycle, I want them to have the variable continued to be false, true, and false respectively because "ab" is the first match cycle, "aab" is adjacent to it, and "c" interrupts before the next match "aaab".
"ab" #=> continued = false
"aab" #=> continued = true
"aaab" #=> continued = false
Is there an anchor in origuruma that refers to the end of the previous matching position? If so, that may be used in the regex. If not, I probably need to use things like MatchData#offset. and do some calculation in the loop.
By the way, what is \G in origuruma regex? I had the impression that it might be the anchor that I want, but I am not sure what it is.
I don't believe the offset data is available using those methods. You'll probably have to use Regexp#match, passing along the location each time. The returned MatchData object contains all the info you need to do any substitutions etc too.
Of course, you'll have to be careful if you are incrementing offsets in combination with doing string substitutions, if the length of the replacement is not the same as the length of the match. A common pattern here is to walk the string backwards, but I don't think you'll be able to follow that pattern with these methods, so you'll need to adjust the offsets.
EDIT | Actually, you would be able to walk the string backwards, if you do the replacement in a completely separate step. First find everything you need to replace, along with the offsets. Next, iterate that list in reverse order, doing your substitutions.
StringScanner would be well suited to this task: http://corelib.rubyonrails.org/classes/StringScanner.html
require 'strscan'
s = StringScanner.new('abaabcaaab')
begin
puts s.pos
s.scan_until(/a+b/)
puts s.matched
end while !s.matched.nil?
outputs
0
ab
2
aab
5
aaab
10
nil
So you could then just keep track of the length of the last match and the position and do the math to see if they are adjacent.

Regular expression to match my pattern of words, wild chars

can you help me with this:
I want a regular expression for my Ruby program to match a word with the below pattern
Pattern has
List of letters ( For example. ABCC => 1 A, 1 B, 2 C )
N Wild Card Charaters ( N can be 0 or 1 or 2)
A fixed word (for example “XY”).
Rules:
Regarding the List of letters, it should match words with
a. 0 or 1 A
b. 0 or 1 B
c. 0 or 1 or 2 C
Based on the value of N, there can be 0 or 1 or 2 wild chars
Fixed word is always in the order it is given.
The combination of all these can be in any order and should match words like below
ABWXY ( if wild char = 1)
BAXY
CXYCB
But not words with 2 A’s or 2 B’s
I am using the pattern like ^[ABCC]*.XY$
But it looks for words with more than 1 A, or 1 B or 2 C's and also looks for words which end with XY, I want all words which have XY in any place and letters and wild chars in any postion.
If it HAS to be a regex, the following could be used:
if subject =~
/^ # start of string
(?!(?:[^A]*A){2}) # assert that there are less than two As
(?!(?:[^B]*B){2}) # and less than two Bs
(?!(?:[^C]*C){3}) # and less than three Cs
(?!(?:[ABCXY]*[^ABCXY]){3}) # and less than three non-ABCXY characters
(?=.*XY) # and that XY is contained in the string.
/x
# Successful match
else
# Match attempt failed
end
This assumes that none of the characters A, B, C, X, or Y are allowed as wildcards.
I consider myself to be fairly good with regular expressions and I can't think of a way to do what you're asking. Regular expressions look for patterns and what you seem to want is quite a few different patterns. It might be more appropriate to in your case to write a function which splits the string into characters and count what you have so you can satisfy your criteria.
Just to give an example of your problem, a regex like /[abc]/ will match every single occurrence of a, b and c regardless of how many times those letters appear in the string. You can try /c{1,2}/ and it will match "c", "cc", and "ccc". It matches the last case because you have a pattern of 1 c and 2 c's in "ccc".
One thing I have found invaluable when developing and debugging regular expressions is rubular.com. Try some examples and I think you'll see what you're up against.
I don't know if this is really any help but it might help you choose a direction.
You need to break out your pattern properly. In regexp terms, [ABCC] means "any one of A, B or C" where the duplicate C is ignored. It's a set operator, not a grouping operator like () is.
What you seem to be describing is creating a regexp based on parameters. You can do this by passing a string to Regexp.new and using the result.
An example is roughly:
def match_for_options(options)
pattern = '^'
pattern << 'A' * options[:a] if (options[:a])
pattern << 'B' * options[:b] if (options[:b])
pattern << 'C' * options[:c] if (options[:c])
Regexp.new(pattern)
end
You'd use it something like this:
if (match_for_options(:a => 1, :c => 2).match('ACC'))
# ...
end
Since you want to allow these "elements" to appear in any order, you might be better off writing a bit of Ruby code that goes through the string from beginning to end and counts the number of As, Bs, and Cs, finds whether it contains your desired substring. If the number of As, Bs, and Cs, is in your desired limits, and it contains the desired substring, and its length (i.e. the number of characters) is equal to the length of the desired substring, plus # of As, plus # of Bs, plus # of Cs, plus at most N characters more than that, then the string is good, otherwise it is bad. Actually, to be careful, you should first search for your desired substring and then remove it from the original string, then count # of As, Bs, and Cs, because otherwise you may unintentionally count the As, Bs, and Cs that appear in your desired string, if there are any there.
You can do what you want with a regular expression, but it would be a long ugly regular expression. Why? Because you would need a separate "case" in the regular expression for each of the possible orders of the elements. For example, the regular expression "^ABC..XY$" will match any string beginning with "ABC" and ending with "XY" and having two wild card characters in the middle. But only in that order. If you want a regular expression for all possible orders, you'd need to list all of those orders in the regular expression, e.g. it would begin something like "^(ABC..XY|ACB..XY|BAC..XY|BCA..XY|" and go on from there, with about 5! = 120 different orders for that list of 5 elements, then you'd need more for the cases where there was no A, then more for cases where there was no B, etc. I think a regular expression is the wrong tool for the job here.

Resources