count quotes in a string that do not have a backslash before them - ruby

Hey I'm trying to use a regex to count the number of quotes in a string that are not preceded by a backslash..
for example the following string:
"\"Some text
"\"Some \"text
The code I have was previously using String#count('"')
obviously this is not good enough
When I count the quotes on both these examples I need the result only to be 1
I have been searching here for similar questions and ive tried using lookbehinds but cannot get them to work in ruby.
I have tried the following regexs on Rubular from this previous question
/[^\\]"/
^"((?<!\\)[^"]+)"
^"([^"]|(?<!\)\\")"
None of them give me the results im after
Maybe a regex is not the way to do that. Maybe a programatic approach is the solution

How about string.count('"') - string.count("\\"")?

result = subject.scan(
/(?: # match either
^ # start-of-string\/line
| # or
\G # the position where the previous match ended
| # or
[^\\] # one non-backslash character
) # then
(\\\\)* # match an even number of backslashes (0 is even, too)
" # match a quote/x)
gives you an array of all quote characters (possibly with a preceding non-quote character) except unescaped ones.
The \G anchor is needed to match successive quotes, and the (\\\\)* makes sure that backslashes are only counted as escaping characters if they occur in odd numbers before the quote (to take Amarghosh's correct caveat into account).

Related

Why 'scan' reads multiple lines

My test configuration file(test_config.conf) looks as below
[DEFAULT]
system_name=
#test
flag=true
I want to read this and scan the value for key "system_name", with the expected output nil. I could have used config parser to read the contents, but using scan is my requirement.
I did:
File.read
Scan: file_data.scan(/^#{each}\s*=\s*(?!.*#)\s*(.*)/)
Regex: ^system_name\s*=\s*(?!.*#)\s*(.*)$
I used (?!.*#) to ignore the values that start with #.
It returns #test. Could someone help me understand why it does so, and how I can change my regex to make it work as expected?
It is another case of how backtracking confuses regex users. (?!.*#) negative lookahead must match a location that is not immediately followed with #. Since the preceding pattern part can match the string in various ways, once failed, the regex engine retries the quantified subpatterns. So, in your case, \s* matches 0 or more whitespaces. Once the regex engine matched all the whitespaces after =, it finds # - and fails. Then backtracks: tries to match zero whitespaces. And finds out that there is no # after =. And succeeds.
Use a possessive quantifier with \s*+ to disallow backtracking:
^system_name\s*=\s*+(?!#)(.*)$
^
See the Rubular demo. So, the lookahead will only be run once after all the 0+ whitespaces are matched. If it fails to match, the whole match will be failed right away.
Another way is to use [^\s#] negated character class:
^system_name\s*=\s*([^\s#].*)$
^^^^^^^
See another Rubular demo
Here, [^\s#] will only match a char that is not a whitespace, nor #, and then .* will match any 0+ chars other than line break chars.
As per the feedback inside comments, the structure of the input may be rather loose, and a key=value can follow the system_name line. In that case, you also need to make sure the text you capture does not actually start with some word chars followed with = sign:
/^system_name\s*=\s*+(?!#|\w+=)(.*)$/
See this Rubular demo
Full pattern details:
^ - start of a line
system_name - a literal substring
\s* - 0 or more whitespaces
= - an equal sign
\s*+ - 0 or more whitespaces with no backtracking into the pattern due to *+ possessive quantifier
(?!#|\w+=) - a negative lookahead that fails the match if the # or 1+ word chars and then = are found immediately to the right of the current location (that is right after the 0+ whitespaces)
(.*) - Group 1: any 0+ chars up to the end of the line
$ - end of a line.

Regex - How can I remove specific characters between strings/delimiters?

This is related to cleaning files before parsing them elsewhere, namely, malformed/ugly CSV. I see plenty of examples for removing/matching all characters between certain strings/characters/delimiters, but I cannot find any for specific strings. Example portion of line would look something like:
","Should now be allowed by rule above "Server - Access" added by Rich"\r
To be clear, this is not the entire line, but the entire line is enclosed in quotes and separated by "," and ends in ^M (Windows newline/carriage return).The 'columns' preceding this would be enclosed at each side by ",". I would probably use this too to remove cruft that appears earlier in the line.
What I am trying to get to is the removal of all double quotes between "," and "\r ("Server - Access" - these ones) without removing the delimiters. Alternatively, I may just find and replace them with \" to delimit them for the Ruby CSV library. So far I have this:
(?<=",").*?(?="\\r)
Which basically matches everything between the delimiters. If I replace .*? with anything, be that a letter, double quotes etc, I get zero matches. What am I doing wrong?
Note: This should be Ruby compatible please.
If I understand you correctly, you can use negative lookahead and lookbehind:
text = '","Should now be allowed by rule above "Server - Access" added by Rich"\r'
puts text.gsub(/(?<!,)"(?![,\\r])/, '\"')
# ","Should now be allowed by rule above \"Server - Access\" added by Rich"\r
Of course, this won't work if the values themselves can contain comas and new lines...

grep wildcards issue ubuntu

I have an input file named test which looks like this
leonid sergeevich vinogradov
ilya alexandrovich svintsov
and when I use grep like this grep 'leonid*vinogradov' test it says nothing, but when I type grep 'leonid.*vinogradov' test it gives me the first string. What's the difference between * and .*? Because I see no difference between any number of any characters and any character followed by any number of any characters.
I use ubuntu 14.04.3.
* doesn't match any number of characters, like in a file glob. It is an operator, which indicates 0 or more matches of the previous character. The regular expression leonid*vinogradov would require a v to appear immediately after 0 or more ds. The . is the regular expression metacharcter representing any single character, so .* matches 0 or more arbitrary characters.
grep uses regex and .* matches 0 or more of any characters.
Where as 'leonid*vinogradov' is also evaluated as regex and it means leoni followed by 0 or more of letter d hence your match fails.
It's Regular Expression grep uses, short as regexp, not wildcards you thought. In this case, "." means any character, "" means any number of (include zero) the previous character, so "." means anything here.
Check the link, or google it, it's a powerful tool you'll find worth to knew.

Ruby regex too greedy with back to back matches

I'm working on some text processing in Ruby 1.8.7 to support some custom shortcodes that I've created. Here are some examples of my shortcode:
[CODE first-part]
[CODE first-part second-part]
I'm using the following RegEx to grab the
text.gsub!( /\[CODE (\S+)\s?(\S?)\]/i, replacementText )
The problem is this: the regex doesn't work on the following text:
[CODE first-part][CODE first-part-again]
The results are as follows:
1. first-part][CODE
2. first-part-again
It seems that the \s? is the problematic part of the regex that is searching on until it hits the last space, not the first one. When I change the regex to the following:
\[CODE ([\w-]+)\s?(\S*)\]/i
It works fine. The only concern I have is what all \w vs \s as I want to make sure the \w will match URL-safe characters.
I'm sure there's a perfectly valid explanation, but it's eluding me. Any ideas? Thanks!
Actually, thinking about it, just using [^\]] might not be enough, as it will swallow up all spaces as well. You also need to exclude those:
/\[CODE[ ]([^\]\s]+)\s?([^\]\s]*)\]/i
Note the [ ] - I just think it makes literal spaces more readable.
Working demo.
Explained in free-spacing mode:
\[CODE[ ] # match your identifier
( # capturing group 1
[^\]\s]+ # match one or more non-], non-whitespace characters
) # end of group 1
\s? # match an optional whitespace character
( # capturing group 2
[^\]\s]+ # match zero or more non-], non-whitespace characters
) # end of group 2
\] # match the closing ]
As none of the character classes in the pattern includes ], you can never possibly go beyond the end of the square bracketed expression.
By the way, if you find unnecessary escapes in regex as obscuring as I do, here is the minimal version:
/\[CODE[ ]([^]\s]+)\s?([^]\s]*)]/i
But that is definitely a matter of taste.
The problem was with the greedy \S+ in this
/\[CODE (\S+)\s?(\S?)\]/i
You could try:
/\[CODE (\S+?)\s?(\S?)\]/i
but actually your new character class is IMO superiror.
Even better might be:
/\[CODE ([^\]]+?)\s?([^\]]*)\]/i

Ruby RegEx problem text.gsub[^\W-], '') fails

I'm trying to learn RegEx in Ruby, based on what I'm reading in "The Rails Way". But, even this simple example has me stumped. I can't tell if it is a typo or not:
text.gsub(/\s/, "-").gsub([^\W-], '').downcase
It seems to me that this would replace all spaces with -, then anywhere a string starts with a non letter or number followed by a dash, replace that with ''. But, using irb, it fails first on ^:
syntax error, unexpected '^', expecting ']'
If I take out the ^, it fails again on the W.
>> text = "I love spaces"
=> "I love spaces"
>> text.gsub(/\s/, "-").gsub(/[^\W-]/, '').downcase
=> "--"
Missing //
Although this makes a little more sense :-)
>> text.gsub(/\s/, "-").gsub(/([^\W-])/, '\1').downcase
=> "i-love-spaces"
And this is probably what is meant
>> text.gsub(/\s/, "-").gsub(/[^\w-]/, '').downcase
=> "i-love-spaces"
\W means "not a word"
\w means "a word"
The // generate a regexp object
/[^\W-]/.class
=> Regexp
Step 1: Add this to your bookmarks. Whenever I need to look up regexes, it's my first stop
Step 2: Let's walk through your code
text.gsub(/\s/, "-")
You're calling the gsub function, and giving it 2 parameters.
The first parameter is /\s/, which is ruby for "create a new regexp containing \s (the // are like special "" for regexes).
The second parameter is the string "-".
This will therefore replace all whitespace characters with hyphens. So far, so good.
.gsub([^\W-], '').downcase
Next you call gsub again, passing it 2 parameters.
The first parameter is [^\W-]. Because we didn't quote it in forward-slashes, ruby will literally try run that code. [] creates an array, then it tries to put ^\W- into the array, which is not valid code, so it breaks.
Changing it to /[^\W-]/ gives us a valid regex.
Looking at the regex, the [] says 'match any character in this group. The group contains \W (which means non-word character) and -, so the regex should match any non-word character, or any hyphen.
As the second thing you pass to gsub is an empty string, it should end up replacing all the non-word characters and hyphens with empty string (thereby stripping them out )
.downcase
Which just converts the string to lower case.
Hope this helps :-)
You forgot the slashes. It should be /[^\W-]/
Well, .gsub(/[^\W-]/,'') says replace anything that's a not word nor a - for nothing.
You probably want
>> text.gsub(/\s/, "-").gsub(/[^\w-]/, '').downcase
=> "i-love-spaces"
Lower case \w (\W is just the opposite)
The slashes are to say that the thing between them is a regular expression, much like quotes say the thing between them is a string.

Resources