String#scan not capturing all occurrences - ruby

I'm facing a very strange behaviour with ruby String#scan method return. I have this code below and I can't find out why "scan" doesn't return 2 elements.
str = "10011011001"
regexp = "0110"
p str.scan(/(#{regexp})/)
==> [["0110"]]
String "str" clearly contains 2 occurrences of pattern "0110".
I want to fetch all the occurences of my regexp in str of course.

The reason is that after finding the first result, the regex engine continues its walk at the position after this first result. So the zero at the end of the first result can't be reuse for an other result.
The way to get overlapping results is to put your pattern in a lookahead and in a capture group (a lookahead is only a zero-width assertion (a test) and doesn't consume any characters). In this way the regex engine advance always one character at a time and can test all positions in the string even something is captured in the group:
(?=(yourpattern))
Then your result is in the capture group 1
With your example:
p str.scan(/(?=(0110))/)
[["0110"], ["0110"]]

str = "10011011001"
match = "0110"
str.chars.each_cons(match.size).map(&:join).select { |cons| cons == match }
Should do it.

Related

Ruby regex avoid matching a group

I have this code running inside a buffer (used to unescape a JS string in Ruby):
elsif hex_substring =~ /^\\u[0-9a-fA-F]{1,4}/
hex_substring.scan(/^((\\u[\da-fA-F]{4}){1,})/) do |match|
hex_byte = match[0]
buffer << JSON.load(%Q("#{hex_byte}"))
hex_index += hex_byte.length
end
...
I have a concern that the scan() is matching a bit too much:
hex_substring.scan(/^((\\u[\da-fA-F]{4}){1,})/)
# => [["\\ud83c\\udfec", "\\udfec"]]
I am using only "\\ud83c\\udfec", not "\\udfec".
Is there a way in Ruby or in regex to grab only the first part?
You should use a single grouping construct here, the one to match 1 or more occurrences of four hex chars, and omit the inner capturing group that resulted in an extra item in the resulting array:
.scan(/^(?:\\u[\da-fA-F]{4})+/)
Note that + is a simpler and shorter way to write {1,} (one or more occurrences).
Details
^ - start of string
(?: - start of a non-capturing group (what it matches won't be added to the final scan result):
\\u - a \u substring
[\da-fA-F]{4} - four hex chars
)+ - 1 or more occurrences (of the group pattern sequence).

best way to find substring in ruby using regular expression

I have a string https://stackverflow.com. I want a new string that contains the domain from the given string using regular expressions.
Example:
x = "https://stackverflow.com"
newstring = "stackoverflow.com"
Example 2:
x = "https://www.stackverflow.com"
newstring = "www.stackoverflow.com"
"https://stackverflow.com"[/(?<=:\/\/).*/]
#⇒ "stackverflow.com"
(?<=..) is a positive lookbehind.
If string = "http://stackoverflow.com",
a really easy way is string.split("http://")[1]. But this isn't regex.
A regex solution would be as follows:
string.scan(/^http:\/\/(.+)$/).flatten.first
To explain:
String#scan returns the first match of the regex.
The regex:
^ matches beginning of line
http: matches those characters
\/\/ matches //
(.+) sets a "match group" containing any number of any characters. This is the value returned by the scan.
$ matches end of line
.flatten.first extracts the results from String#scan, which in this case returns a nested array.
You might want to try this:
#!/usr/bin/env ruby
str = "https://stackoverflow.com"
if mtch = str.match(/(?::\/\/)(/S)/)
f1 = mtch.captures
end
There are two capturing groups in the match method: the first one is a non-capturing group referring to your search pattern and the second one referring to everything else afterwards. After that, the captures method will assign the desired result to f1.
I hope this solves your problem.

How does this gsub and regex work?

I'm trying to learn ruby and having a hard time figuring out what each individual part of this code is doing. Specifically, how does the global subbing determine whether two sequential numbers are both one of these values [13579] and how does it add a dash (-) in between them?
def DashInsert(num)
num_str = num.to_s
num_str.gsub(/([13579])(?=[13579])/, '\1-')
end
num_str.gsub(/([13579])(?=[13579])/, '\1-')
() called capturing group, which captures the characters matched by the pattern present inside the capturing group. So the pattern present inside the capturing group is [13579] which matches a single digit from the given set of digits. That corresponding digit was captured and stored inside index 1.
(?=[13579]) Positive lookahead which asserts that the match must be followed by the character or string matched by the pattern inside the lookahead. Replacement will occur only if this condition is satisfied.
\1 refers the characters which are present inside the group index 1.
Example:
> "13".gsub(/([13579])(?=[13579])/, '\1-')
=> "1-3"
You may start with some random tests:
def DashInsert(num)
num_str = num.to_s
num_str.gsub(/([13579])(?=[13579])/, '\1-')
end
10.times{
x = rand(10000)
puts "%6i: %6s" % [x,DashInsert(x)]
}
Example:
9633: 963-3
7774: 7-7-74
6826: 6826
7386: 7-386
2145: 2145
7806: 7806
9499: 949-9
4117: 41-1-7
4920: 4920
14: 14
And now to check the regex.
([13579]) take any odd number and remember it (it can be used later with \1
(?=[13579]) Check if the next number is also odd, but don't take it (it still remains in the string)
'\1-' Output the first odd num and ab a - to it.
In other word:
Puts a - between each two odds numbers.

At which position does the regex fail?

I need a very simple string validator that would show where is first symbol not corresponding to the desired format. I want to use regex but in this case I have to find the place where the string stops corresponding to the expression and I can't find a method that would do that.
(It's got to be a fairly simple method... maybe there isn't one?)
For example if I have regex:
/^Q+E+R+$/
with string:
"QQQQEEE2ER"
The desired result should be 7
An idea: what you can do is to tokenize your pattern and write it with optional nested capturing groups:
^(Q+(E+(R+($)?)?)?)?
Then you only need to count the number of capture groups you obtain to know where the regex engine stops in the pattern and you can determine the offset of the match end in the string with the whole match length.
As #zx81 notices it in his comment, if one of the elements can match the next element (example Q can match the element E), things become different.
Let's say that Q is \w (and can match E and R). For the string QQQEEERRR the precedent pattern will give only one capturing group (the greedy \w+ matches all) when ^(\w+)(E+)(R+)$ will give three groups: QQQEE, E, RRR
To obtain the same result you need to add an alternation:
^((?:\w+(?=E)|\w+)(E+(R+($)?)?)?)?
In the alternation, the case where E exists must be tested first, and only if this branch fails (with the lookahead), then the other branch where E doesn't exist is used.
Thus the full pattern can be rewritten like this to deal with this specific case:
^((?:Q+(?=E)|Q+)((?:E+(?=R)|E+)((?:R+(?=$)|R+)($)?)?)?)?
Perhaps could you take a look to the gem amatch too.
This is an interesting task that can be accomplished with a neat regex trick:
^(?:(?=(Q+)))?(?:(?=(Q+E+)))?(?:(?=(Q+E+R+)))?(?:(?=(Q+E+R+$)))?
We have four optional lookaheads checking various parts of the pattern and capturing the partial matches to Groups 1, 2, 3 and 4 incrementally.
Group 1 contains Q+ if it can be matched, in your example QQQQ.
Group 2 contains Q+E+ if it can be matched, in your example EEE.
Group 3 contains Q+E+R+ if it can be matched, in your example nil.
Group 3 contains Q+E+R+$ if it can be matched, in your example nil.
In your code, check which is the last Group that is set by testing !$1.nil?, !$2.nil? and so on.
The last one set gives you the length that is matchable, so in your example $2.length gives you the 7 you wanted.
Incidentally, the fact that Group 2 is the last one set also tells you that we fail on R+.
For your example, you could do the following.
Code
Change your regex from:
/^Q+E+R+$/
to
R = /^(Q*)(E*)(R*)/
and then apply the following method to the string:
def nbr_matched_chars(str)
str.scan(R).flatten.reduce(0) {|t,e| return t if e.nil?; t+e.size }
end
str matches the original regex if and only if nbr_matched_chars(str) == str.size.
Examples
nbr_matched_chars("QQQQEEE2ER") #=> 7
nbr_matched_chars("QQQQEEEERR") #=> 10 (= "QQQQEEEERR".size)
nbr_matched_chars("QQAQQEEEER") #=> 2
Explanation
To see why this [evidently :-)] works, we can look at the results of invoking String#scan, followed by Array#flatten:
"QQQQEEE2ER".scan(r).flatten #=> ["QQQQ", "EEE" , nil ]
"QQQQEEEERR".scan(r).flatten #=> ["QQQQ", "EEEE", "RR"]
"QQAQQEEEER".scan(r).flatten #=> ["QQ" , nil , nil ]

Using Regexp to check whether a string starts with a consonant

Is there a better way to write the following regular expression in Ruby? The first regex matches a string that begins with a (lower case) consonant, the second with a vowel.
I'm trying to figure out if there's a way to write a regular expression that matches the negative of the second expression, versus writing the first expression with several ranges.
string =~ /\A[b-df-hj-np-tv-z]/
string =~ /\A[aeiou]/
The statement
$string =~ /\A[^aeiou]/
will test whether the string starts with a non-vowel character, which includes digits, punctuation, whitespace and control characters. That is fine if you know beforehand that the string begins with a letter, but to check that it starts with a consonant you can use forward look-ahead to test that it starts with both a letter and a non-vowel, like this
$string =~ /\A(?=[^aeiou])(?=[a-z])/i
To match an arbitrary number of consonants, you can use the sub-expression (?i:(?![aeiou])[a-z]) to match a consonant. It is atomic, so you can put a repetition count like {3} right after it. For example, this program finds all the strings in a list that contain three consonants in a row
list = %w/ aab bybt xeix axei AAsE SAEE eAAs xxsa Xxsr /
puts list.select { |word| word =~ /\A(?i:(?![aeiou])[a-z]){3}/ }
output
bybt
xxsa
Xxsr
I modified the answer provided by #Alexander Cherednichenko in order to get rid of the if statements.
/^[^aeiou\W]/i.match(s) != nil
If you want to catch a string that doesn't start with vowels, but only starts with consonants you can use this code below. It returns true if a string starts with any letter other than A, E, I, O, U. s is any string we give to a function
if /^[^aeiou\W]/i.match(s) == nil
return false
else
return true
end
i added at the end to make regular expression case insensitive.
\W is used to catch any non-word character, for example if a string starts with a digit like: "1something"
[^aeiou] means a range of character except a e i o u
And we put ^ at the beginning before [ to indicate that the following range [^aeiou\W] if for the 1st character
Note that ^[^aeiou\W] pattern is not correct because it also matches a line that starts with a digit, or underscore. Borodin's solution is working well, but there is one more possible solution without lookaheads, based on character class subtraction (more here) and using the more contemporary Regexp#match?:
/\A[a-z&&[^aeiou]]/i.match?(word)
See the Rubular demo.
Details
\A - start of a string (^ in Ruby is start of any line)
[a-z&&[^aeiou]] - an a-z character range matching any ASCII letter (/i flag makes it case insensitive) except for the aeiou chars.
See the Ruby demo:
test = %w/ 1word _word ball area programming /
puts test.select { |w| /\A[a-z&&[^aeiou]]/i.match?(w) }
# => ['ball', 'programming']

Resources