Regular expression to match my pattern of words, wild chars - ruby

can you help me with this:
I want a regular expression for my Ruby program to match a word with the below pattern
Pattern has
List of letters ( For example. ABCC => 1 A, 1 B, 2 C )
N Wild Card Charaters ( N can be 0 or 1 or 2)
A fixed word (for example “XY”).
Rules:
Regarding the List of letters, it should match words with
a. 0 or 1 A
b. 0 or 1 B
c. 0 or 1 or 2 C
Based on the value of N, there can be 0 or 1 or 2 wild chars
Fixed word is always in the order it is given.
The combination of all these can be in any order and should match words like below
ABWXY ( if wild char = 1)
BAXY
CXYCB
But not words with 2 A’s or 2 B’s
I am using the pattern like ^[ABCC]*.XY$
But it looks for words with more than 1 A, or 1 B or 2 C's and also looks for words which end with XY, I want all words which have XY in any place and letters and wild chars in any postion.

If it HAS to be a regex, the following could be used:
if subject =~
/^ # start of string
(?!(?:[^A]*A){2}) # assert that there are less than two As
(?!(?:[^B]*B){2}) # and less than two Bs
(?!(?:[^C]*C){3}) # and less than three Cs
(?!(?:[ABCXY]*[^ABCXY]){3}) # and less than three non-ABCXY characters
(?=.*XY) # and that XY is contained in the string.
/x
# Successful match
else
# Match attempt failed
end
This assumes that none of the characters A, B, C, X, or Y are allowed as wildcards.

I consider myself to be fairly good with regular expressions and I can't think of a way to do what you're asking. Regular expressions look for patterns and what you seem to want is quite a few different patterns. It might be more appropriate to in your case to write a function which splits the string into characters and count what you have so you can satisfy your criteria.
Just to give an example of your problem, a regex like /[abc]/ will match every single occurrence of a, b and c regardless of how many times those letters appear in the string. You can try /c{1,2}/ and it will match "c", "cc", and "ccc". It matches the last case because you have a pattern of 1 c and 2 c's in "ccc".
One thing I have found invaluable when developing and debugging regular expressions is rubular.com. Try some examples and I think you'll see what you're up against.
I don't know if this is really any help but it might help you choose a direction.

You need to break out your pattern properly. In regexp terms, [ABCC] means "any one of A, B or C" where the duplicate C is ignored. It's a set operator, not a grouping operator like () is.
What you seem to be describing is creating a regexp based on parameters. You can do this by passing a string to Regexp.new and using the result.
An example is roughly:
def match_for_options(options)
pattern = '^'
pattern << 'A' * options[:a] if (options[:a])
pattern << 'B' * options[:b] if (options[:b])
pattern << 'C' * options[:c] if (options[:c])
Regexp.new(pattern)
end
You'd use it something like this:
if (match_for_options(:a => 1, :c => 2).match('ACC'))
# ...
end

Since you want to allow these "elements" to appear in any order, you might be better off writing a bit of Ruby code that goes through the string from beginning to end and counts the number of As, Bs, and Cs, finds whether it contains your desired substring. If the number of As, Bs, and Cs, is in your desired limits, and it contains the desired substring, and its length (i.e. the number of characters) is equal to the length of the desired substring, plus # of As, plus # of Bs, plus # of Cs, plus at most N characters more than that, then the string is good, otherwise it is bad. Actually, to be careful, you should first search for your desired substring and then remove it from the original string, then count # of As, Bs, and Cs, because otherwise you may unintentionally count the As, Bs, and Cs that appear in your desired string, if there are any there.
You can do what you want with a regular expression, but it would be a long ugly regular expression. Why? Because you would need a separate "case" in the regular expression for each of the possible orders of the elements. For example, the regular expression "^ABC..XY$" will match any string beginning with "ABC" and ending with "XY" and having two wild card characters in the middle. But only in that order. If you want a regular expression for all possible orders, you'd need to list all of those orders in the regular expression, e.g. it would begin something like "^(ABC..XY|ACB..XY|BAC..XY|BCA..XY|" and go on from there, with about 5! = 120 different orders for that list of 5 elements, then you'd need more for the cases where there was no A, then more for cases where there was no B, etc. I think a regular expression is the wrong tool for the job here.

Related

How to use regular expressions to match numbers with some exceptions

I need to match numbers in groups of 5, from 1 to 5, with the following exceptions:
Numbers can't include zeros
Numbers can't be like 11111, 22222 and so on.
Numbers can't be like 12345 or 54321
Some examples of valid numbers:
14252, 45121, 43412, 51321 ...
So far I got an expression to group the numbers and do not allow zeros.
/[1-5]{5}/
But I'm having some trouble to handle the second and third exceptions. I tried unsuccessfully to use a negative lookahead to disallow a match if I have a pattern of repeated numbers.
?!11111|?!22222
I'm trying with this expression:
((?!11111)[1-5]{5}?)
How can I write regular expressions to not match certain patterns?
I will eventually change it to not match any other sequence of numbers.
First off, you don't have to cram everything into one regex. Regexes are already complicated, if you can do it in multiple regexes, that will often make things much simpler and allow for more flexible code. For example, you can customize the error message based on which condition failed. Usually you only need to fold multiple regexes together for performance reasons, and there are tools to do that automatically.
So far I got an expression to group the numbers and do not allow zeros.
/[1-5]{5}/
Careful, you have to anchor at both ends that else it will accept any string that contains a run of 5 of 1-5.
/\A[1-5]{5}\z/
Numbers can't be like 11111, 22222 and so on.
Use a capture within the regex to accomplish this. Capture the first number, then see if there's four more. () to capture and \1 to refer to what was captured.
/\A([1-5])\1{4}\z/'
Numbers can't be like 12345 or 54321
/\A(?:12345|54321)\z/
Here's a solution that does not use a regular expression. I understand we are to determine if: a) the string contains five characters; b) each character equals '1', '2', '3', '4' or '5'; c) the string contains at least two different characters; and d) the string is neither '12345' nor '54321'. We can do that as follows.
def is_ok?(str)
str.size == 5 && # five characters
(str.chars - ['1','2','3','4','5']).empty? && # only the digits '1'-'5'
str.squeeze.size > 1 && # not all the same character
str != '12345' && # not an increasing sequence
str != '54321' # not a decreasing sequence
end
is_ok? '12543' #=> true
is_ok? '12043' #=> false
is_ok? '12643' #=> false
is_ok? '22222' #=> false
is_ok? '12345' #=> false
is_ok? '54321' #=> false
You have the right idea using negative lookaheads, just the syntax was a little off. This works for me:
\A(?!11111|22222|33333|44444|55555|12345|54321)[1-5]{5}\z
How about this?
^(?!([1-5])\1{4})(?!54321)(?!12345)[1-5]{5}$

How does this gsub and regex work?

I'm trying to learn ruby and having a hard time figuring out what each individual part of this code is doing. Specifically, how does the global subbing determine whether two sequential numbers are both one of these values [13579] and how does it add a dash (-) in between them?
def DashInsert(num)
num_str = num.to_s
num_str.gsub(/([13579])(?=[13579])/, '\1-')
end
num_str.gsub(/([13579])(?=[13579])/, '\1-')
() called capturing group, which captures the characters matched by the pattern present inside the capturing group. So the pattern present inside the capturing group is [13579] which matches a single digit from the given set of digits. That corresponding digit was captured and stored inside index 1.
(?=[13579]) Positive lookahead which asserts that the match must be followed by the character or string matched by the pattern inside the lookahead. Replacement will occur only if this condition is satisfied.
\1 refers the characters which are present inside the group index 1.
Example:
> "13".gsub(/([13579])(?=[13579])/, '\1-')
=> "1-3"
You may start with some random tests:
def DashInsert(num)
num_str = num.to_s
num_str.gsub(/([13579])(?=[13579])/, '\1-')
end
10.times{
x = rand(10000)
puts "%6i: %6s" % [x,DashInsert(x)]
}
Example:
9633: 963-3
7774: 7-7-74
6826: 6826
7386: 7-386
2145: 2145
7806: 7806
9499: 949-9
4117: 41-1-7
4920: 4920
14: 14
And now to check the regex.
([13579]) take any odd number and remember it (it can be used later with \1
(?=[13579]) Check if the next number is also odd, but don't take it (it still remains in the string)
'\1-' Output the first odd num and ab a - to it.
In other word:
Puts a - between each two odds numbers.

At which position does the regex fail?

I need a very simple string validator that would show where is first symbol not corresponding to the desired format. I want to use regex but in this case I have to find the place where the string stops corresponding to the expression and I can't find a method that would do that.
(It's got to be a fairly simple method... maybe there isn't one?)
For example if I have regex:
/^Q+E+R+$/
with string:
"QQQQEEE2ER"
The desired result should be 7
An idea: what you can do is to tokenize your pattern and write it with optional nested capturing groups:
^(Q+(E+(R+($)?)?)?)?
Then you only need to count the number of capture groups you obtain to know where the regex engine stops in the pattern and you can determine the offset of the match end in the string with the whole match length.
As #zx81 notices it in his comment, if one of the elements can match the next element (example Q can match the element E), things become different.
Let's say that Q is \w (and can match E and R). For the string QQQEEERRR the precedent pattern will give only one capturing group (the greedy \w+ matches all) when ^(\w+)(E+)(R+)$ will give three groups: QQQEE, E, RRR
To obtain the same result you need to add an alternation:
^((?:\w+(?=E)|\w+)(E+(R+($)?)?)?)?
In the alternation, the case where E exists must be tested first, and only if this branch fails (with the lookahead), then the other branch where E doesn't exist is used.
Thus the full pattern can be rewritten like this to deal with this specific case:
^((?:Q+(?=E)|Q+)((?:E+(?=R)|E+)((?:R+(?=$)|R+)($)?)?)?)?
Perhaps could you take a look to the gem amatch too.
This is an interesting task that can be accomplished with a neat regex trick:
^(?:(?=(Q+)))?(?:(?=(Q+E+)))?(?:(?=(Q+E+R+)))?(?:(?=(Q+E+R+$)))?
We have four optional lookaheads checking various parts of the pattern and capturing the partial matches to Groups 1, 2, 3 and 4 incrementally.
Group 1 contains Q+ if it can be matched, in your example QQQQ.
Group 2 contains Q+E+ if it can be matched, in your example EEE.
Group 3 contains Q+E+R+ if it can be matched, in your example nil.
Group 3 contains Q+E+R+$ if it can be matched, in your example nil.
In your code, check which is the last Group that is set by testing !$1.nil?, !$2.nil? and so on.
The last one set gives you the length that is matchable, so in your example $2.length gives you the 7 you wanted.
Incidentally, the fact that Group 2 is the last one set also tells you that we fail on R+.
For your example, you could do the following.
Code
Change your regex from:
/^Q+E+R+$/
to
R = /^(Q*)(E*)(R*)/
and then apply the following method to the string:
def nbr_matched_chars(str)
str.scan(R).flatten.reduce(0) {|t,e| return t if e.nil?; t+e.size }
end
str matches the original regex if and only if nbr_matched_chars(str) == str.size.
Examples
nbr_matched_chars("QQQQEEE2ER") #=> 7
nbr_matched_chars("QQQQEEEERR") #=> 10 (= "QQQQEEEERR".size)
nbr_matched_chars("QQAQQEEEER") #=> 2
Explanation
To see why this [evidently :-)] works, we can look at the results of invoking String#scan, followed by Array#flatten:
"QQQQEEE2ER".scan(r).flatten #=> ["QQQQ", "EEE" , nil ]
"QQQQEEEERR".scan(r).flatten #=> ["QQQQ", "EEEE", "RR"]
"QQAQQEEEER".scan(r).flatten #=> ["QQ" , nil , nil ]

XPATH : replace every ohter whitespace

I'd like to replace every other (odd?) space with x. The result should be:
axb axb axb axb axb
I tried something like:
replace ("a b a b a b a b" , " " , "x")[position() mod 2 = 0]
-- but with no result.
First of all: fn:replace requires an XPath 2.0 (or XQuery) compatible query processor.
You cannot use fn:replace with an predicate like this. There is no array-like access to characters in XPath (like you're used to from eg. C). You probably could also solve this using fn:tokenize and a for-loop, but that's getting things rather complicated.
Your query did not return any result, as there is exactly one result (single element string sequence), but the predicate only returns every second.
Use a regular expression instead. This expression matches on non-space (\S) and space (\s) and replaces those patterns by a version with x in between. The star quantifier in the end is important for odd number of match groups (like in your example).
replace("a b a b a b a b" , "(\S+)\s+(\S+\s*)", "$1x$2")

Ruby Regular expressions (regex): character appear only once at most

Suppose I want to make sure a string x equals any combination of abcd (each character appearing one or zero times-->each character should not repeat, but the combination may appear in any order)
valid ex: bc .. abcd ... bcad ... b... d .. dc
invalid ex. abcdd, cc, bbbb, abcde (ofcourse)
my effort:
I tried various techniques:
the closest I came was
x =~ ^(((a)?(b)?(c)?(d)?))$
but this wont work if I do not type them in the same order as i have written:
works for: ab, acd, abcd, a, d, c
wont work for: bcda, cb, da (anything that is not in the above order)
you can test your solutions here : http://rubular.com/r/wCpD355bub
PS: the characters may not be in alphabetical order, it could be u c e t
If you can use things besides regexes, you can try:
str.chars.uniq.length == str.length && str.match(/^[a-d]+$/)
The general idea here is that you just strip any duplicated characters from the string, and if the length of the uniq array is not equal to the length of the source string, you have a duplicated character in the string. The regex then enforces the character set.
This can probably be improved, but it's pretty straightforward. It does create a couple of extra arrays, so you might want a different approach if this needs to be used in a performance-critical location.
If you want to stick to regexes, you could use:
str.match(/^[a-d]+$/) && !str.match(/([a-d]).*\1/)
That'll basically check that the string only contains the allowed characters, and that those characters are never repeated.
This is really not what regular expressions are meant to do, but if you really really want to.
Here is a regex that satisfies the conditions.
^([a-d])(?!(\1))([a-d])?(?!(\1|\3))([a-d])?(?!(\1|\3|\5))([a-d])?(?!(\1|\3|\5|\7))$
basically it goes through each character, making the group, then makes sure that that group isn't matched. Then checks the next character, and makes sure that group and the previous groups don't match.
You can reverse it (match the condition that would make it fail)
re = /^ # start of line
(?=.*([a-d]).*\1) # match if a letter appears more than once
| # or
(?=.*[^a-d]) # match if a non abcd char appears
/x
puts 'fail' if %w{bc abcd bcad b d dc}.any?{|s| s =~ re}
puts 'fail' unless %w{abcdd cc bbbb abcde}.all?{|s| s =~ re}
I don't think regexes are well suited to this problem, so here is another non-regex solution. It's recursive:
def match_chars_no_more_than_once(characters, string)
return true if string.empty?
if characters.index(string[0])
match_chars_no_more_than_once(characters.sub(string[0],''), string[1..-1])
else
false
end
end
%w{bc bdac hello acbbd cdda}.each do |string|
p [string, match_chars_no_more_than_once('abcd', string)]
end
Output:
["bc", true]
["bdac", true]
["hello", false]
["acbbd", false]
["cdda", false]

Resources