Is there a way to match double quotes inside two double quotes? - ruby

I tried the following regex, but it matches all the double quotes:
(?>(?<=(")|))"(?(1)(?!"))
Here is a sample of the text:
"[\"my cars last night\",
\"Burger\",\"Decaf\" shirt\",
\"Mocha\",\"marshmallows\",
\"Coffee Mission\"]"
The pattern I want to match is the double quote between the double quotes in line 2

As a general rule, I would say: no.
Given a string:
\"Burger\" \"Decaf\" shirt\"
How do you decide which \" is superfluous (non-matching)? Is this one after Burger, one after Decaf or one after shirt? Or one before any of these words? I believe the choice is arbitrary.
Although in your particular example it seems that you want all \" that are not adjacent to comma.
These can be found by following regexp:
(?<!^)(?<![,\[])\\"(?![,\]])
We start with \\" (backslash followed by double quote) in the center.
Then we use negative lookahead to discard all matches that are followed by comma or closing square bracket.
Then we use negative lookbehind to discard all matches that happen after comma or opening bracket.
Regexp engine that I have used can't cope with alternation inside lookaround statements. To work around it, I take advantage of the fact that lookarounds are zero-length matches and I prepend negative lookbehind that matches beginning of line at the beginning of expression.
Proof (in perl):
$ cat test
"[\"my cars last night\",
\"Burger\",\"Decaf\" shirt\",
\"Mocha\",\"marshmallows\",
\"Coffee Mission\"]"
$ perl -n -e '$_ =~ s/(?<!^)(?<![,\[])\\"(?![,\]])/|||/g; print $_' test
"[\"my cars last night\",
\"Burger\",\"Decaf||| shirt\",
\"Mocha\",\"marshmallows\",
\"Coffee Mission\"]"

Let's assume that the format of your string must be like this:
["item1", "item2", ... "itemN"]
The way to know if a double quote is a closing double quote is to check if it is followed by a comma or a closing square bracket.
To find a double quote enclosed by double quotes, you must match all well formatted items from the beginning until an unexpected quote.
Example to find the first enclosed quote (if it exists):
(?:"[^"]*",\s*)*+"[^"]*\K"
demo
But this works only for one enclosed quote in all the string and isn't useful if you want to find all of them.
to find all quotes:
(?:\G(?!\A)|(?:\A[^"]*|[^"]*",\s*)(?:"[^"]*",\s*)*+")[^"]*\K"(?!\s*[\],])
demo

Related

Ruby - convert single backslash unicode to double backslash

I have a string that looks something like '\\u001E\\u001Csome_text\u001F' - where the first two characters are escaped with two backslashes and the last one only has one backslash.
I want to convert that string so that all the unicode literals have two backslashes in them, so the output would look like '\\u001E\\u001Csome_text\\u001F'. What ways could I go about doing this?
If the goal is to find instances of '\u' not preceded by a '\' character, then a negative lookbehind should suit the need to match:
(?<!\\)\\u
If you want to match the whole character, then you can also use a positive lookahead to verify the following characters:
(?<!\\)\\u(?=[0-9A-Z]{4})
Both of these will return the instances of '\u', which you can replace with '\\u'

ruby gsub new line characters

I have a string with newline characters that I want to gsub out for white space.
"hello I\r\nam a test\r\n\r\nstring".gsub(/[\\r\\n]/, ' ')
something like this ^ only my regex seems to be replacing the 'r' and 'n' letters as well. the other constraint is sometimes the pattern repeats itself twice and thus would be replaced with two whitespaces in a row, although this is not preferable it is better than all the text being cut apart.
If there is a way to only select the new line characters. Or even better if there a more rubiestic way of approaching this outside of going to regex?
If you have mixed consecutive line breaks that you want to replace with a single space, you may use the following regex solution:
s.gsub(/\R+/, ' ')
See the Ruby demo.
The \R matches any type of line break and + matches one or more occurrences of the quantified subpattern.
Note that in case you have to deal with an older version of Ruby, you will need to use the negated character class [\r\n] that matches either \r or \n:
.gsub(/[\r\n]+/, ' ')
or - add all possible linebreaks:
/gsub(/(?:\u000D\u000A|[\u000A\u000B\u000C\u000D\u0085\u2028\u2029])+/, ' ')
This should work for your test case:
"hello I\r\nam a test\r\n\r\nstring".gsub(/[\r\n]/, ' ')
If you don't want successive \r\n characters to result in duplicate spaces you can use this instead:
"hello I\r\nam a test\r\n\r\nstring".gsub(/[\r\n]+/, ' ')
(Note the addition of the + after the character class.)
As Wiktor mentioned, you're using \\ in your regex, which inside the regex literal /.../ actually escapes a backslash, meaning you're matching a literal backslash \, r, or n as part of your expression. Escaping characters works differently in regex literals, since \ is used so much, it makes no sense to have a special escape for it (as opposed to regular strings, which is a whole different animal).

split string by spaces properly accounting for quotes and backslashes (ruby)

I want to split a string (insecure foreign line, like exim_mainlog line) by spaces, but not by spaces that are inside of double quotes, and ignore if the quote is escaped by a backslash like \", and ignore the backslash if it is just escaped like \\. Without slow parsing the string manually with FSM.
Example line:
U=mailnull T="test \"quote\" and wild blackslash\\" P=esmtps
Should be split into:
["U=mailnull", "T=\"test \\\"quote\\\" and wild blackslash\\\"", "P=esmtps"]
(Btw, I think ruby should had method for such split.., sigh).
I think I found simple enough solution: input.scan(/(?:"(?:\\.|[^"])*"|[^" ])+/)

ruby regex about escape a escape

I am trying to write a regex in Ruby to test a string such as:
"GET \"anything/here.txt\""
the point is, everything can be in the outer double quote, but all double quotes in the outer double quotes must be escaped by back slash(otherwise it doesnt match). So for example
"GET "anything/here.txt""
this will not be a proper line.
I tried many ways to write the regex but doest work. can anyone help me with this? thank you
You can use positive lookbehind:
/\A"((?<=\\)"|[^"])*"\z/
This does exactly what you asked for: "if a double quote appears inside the outer double quotes without a backslash prefixed, it doesn't match."
Some comments:
\A,\z: These match only at the beginning and end of the string. So the pattern has to match against the whole string, not a part of it.
(?<=): This is the syntax for positive lookbehind; it asserts that a pattern must match directly before the current position. So (?<=\\)" matches "a double quote which is preceded by a backslash".
[^"]: This matches "any character which is not a backslash".
One point about this regex, is that it will match an inner double quote which is preceded by two backslashes. If that is a problem, post a comment and I'll fix it.
If your version of Ruby doesn't have lookbehind, you could do something like:
/\A"(\\.|[^"\\])*"\z/
Note that unlike the first regexp, this one does not count a double backslash as escaping a quote (rather, the first backslash escapes the second one), so "\\"" will not match.
This works:
/"(?<method>[A-Z]*)\s*\\\"(?<file>[^\\"]*)\\""/
See it on Rubular.
Edit:
"(?<method>[A-Z]*)\s(?<content>(\\\"|[a-z\/\.]*)*)"
See it here.
Edit 2: without (? ...) sequence (for Ruby 1.8.6):
"([A-Z]*)\s((\\\"|[a-z\/\.]*)*)"
Rubular here.
Tested this on Rubular successfully:
\"GET \\\".*\\\"\"
Breakdown:
\" - Escape the " for the regex string, meaning the literal character "
GET - Assuming you just want GET than this is explicit
\\" - Escape \ and " to get the literal string \"
.* - 0 or more of any character other than \n
\\"\" - Escapes for the literal \""
I'm not sure a regex is really your best tool here, but if you insist on using one, I recommend thinking of the string as a sequence of tokens: a quote, then a series of things that are either \\, \" or anything that isn't a quote, then a closing quote at the end. So this:
^"(\\\\|\\"|[^"])*"$

count quotes in a string that do not have a backslash before them

Hey I'm trying to use a regex to count the number of quotes in a string that are not preceded by a backslash..
for example the following string:
"\"Some text
"\"Some \"text
The code I have was previously using String#count('"')
obviously this is not good enough
When I count the quotes on both these examples I need the result only to be 1
I have been searching here for similar questions and ive tried using lookbehinds but cannot get them to work in ruby.
I have tried the following regexs on Rubular from this previous question
/[^\\]"/
^"((?<!\\)[^"]+)"
^"([^"]|(?<!\)\\")"
None of them give me the results im after
Maybe a regex is not the way to do that. Maybe a programatic approach is the solution
How about string.count('"') - string.count("\\"")?
result = subject.scan(
/(?: # match either
^ # start-of-string\/line
| # or
\G # the position where the previous match ended
| # or
[^\\] # one non-backslash character
) # then
(\\\\)* # match an even number of backslashes (0 is even, too)
" # match a quote/x)
gives you an array of all quote characters (possibly with a preceding non-quote character) except unescaped ones.
The \G anchor is needed to match successive quotes, and the (\\\\)* makes sure that backslashes are only counted as escaping characters if they occur in odd numbers before the quote (to take Amarghosh's correct caveat into account).

Resources