Why 'scan' reads multiple lines - ruby

My test configuration file(test_config.conf) looks as below
[DEFAULT]
system_name=
#test
flag=true
I want to read this and scan the value for key "system_name", with the expected output nil. I could have used config parser to read the contents, but using scan is my requirement.
I did:
File.read
Scan: file_data.scan(/^#{each}\s*=\s*(?!.*#)\s*(.*)/)
Regex: ^system_name\s*=\s*(?!.*#)\s*(.*)$
I used (?!.*#) to ignore the values that start with #.
It returns #test. Could someone help me understand why it does so, and how I can change my regex to make it work as expected?

It is another case of how backtracking confuses regex users. (?!.*#) negative lookahead must match a location that is not immediately followed with #. Since the preceding pattern part can match the string in various ways, once failed, the regex engine retries the quantified subpatterns. So, in your case, \s* matches 0 or more whitespaces. Once the regex engine matched all the whitespaces after =, it finds # - and fails. Then backtracks: tries to match zero whitespaces. And finds out that there is no # after =. And succeeds.
Use a possessive quantifier with \s*+ to disallow backtracking:
^system_name\s*=\s*+(?!#)(.*)$
^
See the Rubular demo. So, the lookahead will only be run once after all the 0+ whitespaces are matched. If it fails to match, the whole match will be failed right away.
Another way is to use [^\s#] negated character class:
^system_name\s*=\s*([^\s#].*)$
^^^^^^^
See another Rubular demo
Here, [^\s#] will only match a char that is not a whitespace, nor #, and then .* will match any 0+ chars other than line break chars.
As per the feedback inside comments, the structure of the input may be rather loose, and a key=value can follow the system_name line. In that case, you also need to make sure the text you capture does not actually start with some word chars followed with = sign:
/^system_name\s*=\s*+(?!#|\w+=)(.*)$/
See this Rubular demo
Full pattern details:
^ - start of a line
system_name - a literal substring
\s* - 0 or more whitespaces
= - an equal sign
\s*+ - 0 or more whitespaces with no backtracking into the pattern due to *+ possessive quantifier
(?!#|\w+=) - a negative lookahead that fails the match if the # or 1+ word chars and then = are found immediately to the right of the current location (that is right after the 0+ whitespaces)
(.*) - Group 1: any 0+ chars up to the end of the line
$ - end of a line.

Related

Match strings with line breaks between two characters

How can I match line breaks between two asterisks in Ruby? I have this string
Foo **bar**
test **hello
world** 12345
and I only want to find
**hello
world**
I tried it with \*{1,2}(.*)\n(.*)\*{1,2} but this matches
**bar**
test **
I played with a non greedy matcher like \*{1,2}(.*?)\n(.*?)\*{1,2} but this doesn't work either, so I hope someone can help.
You may use
/\*{1,2}\b([^*]*)\R([^*]*)\b\*{1,2}/
See the Rubular demo
Details
\*{1,2} - 1 or 2 asterisks
\b - a word boundary, the next char must be a word char
([^*]*) - Group 1: any 0+ chars other than *
\R - a line break sequence
([^*]*) - Group 2: any 0+ chars other than *
\b - a word boundary, the preceding char must be a word char
\*{1,2} - 1 or 2 asterisks
Wiktor already gave a good answer. Here is another way of doing it:
(?<=\*{1,2})([^*])*(?=\*{1,2})
Tested here
NOTE: This will not work in Ruby, but it can work in some other languages
From the link in this answer:
Subexp of look-behind must be fixed-width.

Ruby Regexp character class with new line, why not match?

I want to use this regex to match any block comment (c-style) in a string.
But why the below does not?
rblockcmt = Regexp.new "/\\*[.\s]*?\\*/" # match block comment
p rblockcmt=~"/* 22/Nov - add fee update */"
==> nil
And in addition to what Sir Swoveland posted, a . matches any character except a newline:
The following metacharacters also behave like character classes:
/./ - Any character except a newline.
https://ruby-doc.org/core-2.3.0/Regexp.html
If you need . to match a newline, you can specify the m flag, e.g. /.*?/m
Options
The end delimiter for a regexp can be followed by one or more
single-letter options which control how the pattern can match.
/pat/i - Ignore case
/pat/m - Treat a newline as a character matched by .
...
https://ruby-doc.org/core-2.3.0/Regexp.html
Because having exceptions/quirks like newline not matching a . can be painful, some people specify the m option for every regex they write.
It appears that you intend [.\s]*? to match any character or a whitespace, zero or more times, lazily. Firstly, whitespaces are characters, so you don't need \s. That simplifies your expression to [.]*?. Secondly, if your intent is to match any character there is no need for a character class, just write .. Thirdly, and most importantly, a period within a character class is simply the character ".".
You want .*? (or [^*]*).

repeating regex to match mathematical symbol then number fails

I am trying to match mathematica expressions like 1+2 and 1*2/3.... to infinity. Can someone explain why my regex matches the final case below, and how to fix it so that it matches only valid expressions (that might stretch forever)?
perms=["12+2*4","2+2","-2+","12+34-"]
perms.each do |line|
puts "#{line}=#{eval(line)}" if line =~ /^\d+([+-\/*]\d+){1,}/
end
I expected the output to be:
12+2*4=20
2+2=4
Inside a [character set], the - character defines a range of characters -- think of [a-z] or [0-9]. If you want to match a literal -, it must be the first or last character.
/^\d+(?:[+\/*-]\d+)+$/
Other things: {1,} is exactly +; and you need to anchor at the end too, so you don't match 1+2+
You should finalize your expression with $ to match the entire input string:
/^\d+([-+\/*]\d+){1,}$/
The wrong position of the hyphen - is one source of error in your expression. The missing $ the other.

Ruby regex too greedy with back to back matches

I'm working on some text processing in Ruby 1.8.7 to support some custom shortcodes that I've created. Here are some examples of my shortcode:
[CODE first-part]
[CODE first-part second-part]
I'm using the following RegEx to grab the
text.gsub!( /\[CODE (\S+)\s?(\S?)\]/i, replacementText )
The problem is this: the regex doesn't work on the following text:
[CODE first-part][CODE first-part-again]
The results are as follows:
1. first-part][CODE
2. first-part-again
It seems that the \s? is the problematic part of the regex that is searching on until it hits the last space, not the first one. When I change the regex to the following:
\[CODE ([\w-]+)\s?(\S*)\]/i
It works fine. The only concern I have is what all \w vs \s as I want to make sure the \w will match URL-safe characters.
I'm sure there's a perfectly valid explanation, but it's eluding me. Any ideas? Thanks!
Actually, thinking about it, just using [^\]] might not be enough, as it will swallow up all spaces as well. You also need to exclude those:
/\[CODE[ ]([^\]\s]+)\s?([^\]\s]*)\]/i
Note the [ ] - I just think it makes literal spaces more readable.
Working demo.
Explained in free-spacing mode:
\[CODE[ ] # match your identifier
( # capturing group 1
[^\]\s]+ # match one or more non-], non-whitespace characters
) # end of group 1
\s? # match an optional whitespace character
( # capturing group 2
[^\]\s]+ # match zero or more non-], non-whitespace characters
) # end of group 2
\] # match the closing ]
As none of the character classes in the pattern includes ], you can never possibly go beyond the end of the square bracketed expression.
By the way, if you find unnecessary escapes in regex as obscuring as I do, here is the minimal version:
/\[CODE[ ]([^]\s]+)\s?([^]\s]*)]/i
But that is definitely a matter of taste.
The problem was with the greedy \S+ in this
/\[CODE (\S+)\s?(\S?)\]/i
You could try:
/\[CODE (\S+?)\s?(\S?)\]/i
but actually your new character class is IMO superiror.
Even better might be:
/\[CODE ([^\]]+?)\s?([^\]]*)\]/i

count quotes in a string that do not have a backslash before them

Hey I'm trying to use a regex to count the number of quotes in a string that are not preceded by a backslash..
for example the following string:
"\"Some text
"\"Some \"text
The code I have was previously using String#count('"')
obviously this is not good enough
When I count the quotes on both these examples I need the result only to be 1
I have been searching here for similar questions and ive tried using lookbehinds but cannot get them to work in ruby.
I have tried the following regexs on Rubular from this previous question
/[^\\]"/
^"((?<!\\)[^"]+)"
^"([^"]|(?<!\)\\")"
None of them give me the results im after
Maybe a regex is not the way to do that. Maybe a programatic approach is the solution
How about string.count('"') - string.count("\\"")?
result = subject.scan(
/(?: # match either
^ # start-of-string\/line
| # or
\G # the position where the previous match ended
| # or
[^\\] # one non-backslash character
) # then
(\\\\)* # match an even number of backslashes (0 is even, too)
" # match a quote/x)
gives you an array of all quote characters (possibly with a preceding non-quote character) except unescaped ones.
The \G anchor is needed to match successive quotes, and the (\\\\)* makes sure that backslashes are only counted as escaping characters if they occur in odd numbers before the quote (to take Amarghosh's correct caveat into account).

Resources