Ruby - convert single backslash unicode to double backslash - ruby

I have a string that looks something like '\\u001E\\u001Csome_text\u001F' - where the first two characters are escaped with two backslashes and the last one only has one backslash.
I want to convert that string so that all the unicode literals have two backslashes in them, so the output would look like '\\u001E\\u001Csome_text\\u001F'. What ways could I go about doing this?

If the goal is to find instances of '\u' not preceded by a '\' character, then a negative lookbehind should suit the need to match:
(?<!\\)\\u
If you want to match the whole character, then you can also use a positive lookahead to verify the following characters:
(?<!\\)\\u(?=[0-9A-Z]{4})
Both of these will return the instances of '\u', which you can replace with '\\u'

Related

Is there a way to match double quotes inside two double quotes?

I tried the following regex, but it matches all the double quotes:
(?>(?<=(")|))"(?(1)(?!"))
Here is a sample of the text:
"[\"my cars last night\",
\"Burger\",\"Decaf\" shirt\",
\"Mocha\",\"marshmallows\",
\"Coffee Mission\"]"
The pattern I want to match is the double quote between the double quotes in line 2
As a general rule, I would say: no.
Given a string:
\"Burger\" \"Decaf\" shirt\"
How do you decide which \" is superfluous (non-matching)? Is this one after Burger, one after Decaf or one after shirt? Or one before any of these words? I believe the choice is arbitrary.
Although in your particular example it seems that you want all \" that are not adjacent to comma.
These can be found by following regexp:
(?<!^)(?<![,\[])\\"(?![,\]])
We start with \\" (backslash followed by double quote) in the center.
Then we use negative lookahead to discard all matches that are followed by comma or closing square bracket.
Then we use negative lookbehind to discard all matches that happen after comma or opening bracket.
Regexp engine that I have used can't cope with alternation inside lookaround statements. To work around it, I take advantage of the fact that lookarounds are zero-length matches and I prepend negative lookbehind that matches beginning of line at the beginning of expression.
Proof (in perl):
$ cat test
"[\"my cars last night\",
\"Burger\",\"Decaf\" shirt\",
\"Mocha\",\"marshmallows\",
\"Coffee Mission\"]"
$ perl -n -e '$_ =~ s/(?<!^)(?<![,\[])\\"(?![,\]])/|||/g; print $_' test
"[\"my cars last night\",
\"Burger\",\"Decaf||| shirt\",
\"Mocha\",\"marshmallows\",
\"Coffee Mission\"]"
Let's assume that the format of your string must be like this:
["item1", "item2", ... "itemN"]
The way to know if a double quote is a closing double quote is to check if it is followed by a comma or a closing square bracket.
To find a double quote enclosed by double quotes, you must match all well formatted items from the beginning until an unexpected quote.
Example to find the first enclosed quote (if it exists):
(?:"[^"]*",\s*)*+"[^"]*\K"
demo
But this works only for one enclosed quote in all the string and isn't useful if you want to find all of them.
to find all quotes:
(?:\G(?!\A)|(?:\A[^"]*|[^"]*",\s*)(?:"[^"]*",\s*)*+")[^"]*\K"(?!\s*[\],])
demo

ruby regex not matching a string if "+" character in source

i have a string that contains an international phone number e.g. +44 9383 33333 but bizzarely when I attempt to regex match (note the regex is correctly 'escaped') the regex match fails
e.g.
"001144 9383 33333".match(/(001144|004|0|\\+44)/) # works
"+44 9383 33333".match(/(001144|004|0|\\+44)/) # DOES NOT WORK
i've tried escaping the input string e.g. +, \+ etc. etc but to no avail.
i must be doing something really stupid here!?
You've got a double backslash, which is telling the regex parser to look for a literal backslash. Since it's also followed by a + the regex parser is then looking for 1 or more backslashes. Try just \+ (so the whole thing should be: /(001144|004|0|\+44)/

split string by spaces properly accounting for quotes and backslashes (ruby)

I want to split a string (insecure foreign line, like exim_mainlog line) by spaces, but not by spaces that are inside of double quotes, and ignore if the quote is escaped by a backslash like \", and ignore the backslash if it is just escaped like \\. Without slow parsing the string manually with FSM.
Example line:
U=mailnull T="test \"quote\" and wild blackslash\\" P=esmtps
Should be split into:
["U=mailnull", "T=\"test \\\"quote\\\" and wild blackslash\\\"", "P=esmtps"]
(Btw, I think ruby should had method for such split.., sigh).
I think I found simple enough solution: input.scan(/(?:"(?:\\.|[^"])*"|[^" ])+/)

How to allow string with letters, numbers, period, hyphen, and underscore?

I am trying to make a regular expression, that allow to create string with the small and big letters + numbers - a-zA-z0-9 and also with the chars: .-_
How do I make such a regex?
The following regex should be what you are looking for (explanation below):
\A[-\w.]*\z
The following character class should match only the characters that you want to allow:
[-a-zA-z0-9_.]
You could shorten this to the following since \w is equivalent to [a-zA-z0-9_]:
[-\w.]
Note that to include a literal - in your character class, it needs to be first character because otherwise it will be interpreted as a range (for example [a-d] is equivalent to [abcd]). The other option is to escape it with a backslash.
Normally . means any character except newlines, and you would need to escape it to match a literal period, but this isn't necessary inside of character classes.
The \A and \z are anchors to the beginning and end of the string, otherwise you would match strings that contain any of the allowed characters, instead of strings that contain only the allowed characters.
The * means zero or more characters, if you want it to require one or more characters change the * to a +.
/\A[\w\-\.]+\z/
\w means alphanumeric (case-insensitive) and "_"
\- means dash
\. means period
\A means beginning (even "stronger" than ^)
\z means end (even "stronger" than $)
for example:
>> 'a-zA-z0-9._' =~ /\A[\w\-\.]+\z/
=> 0 # this means a match
UPDATED thanks phrogz for improvement

ruby regex about escape a escape

I am trying to write a regex in Ruby to test a string such as:
"GET \"anything/here.txt\""
the point is, everything can be in the outer double quote, but all double quotes in the outer double quotes must be escaped by back slash(otherwise it doesnt match). So for example
"GET "anything/here.txt""
this will not be a proper line.
I tried many ways to write the regex but doest work. can anyone help me with this? thank you
You can use positive lookbehind:
/\A"((?<=\\)"|[^"])*"\z/
This does exactly what you asked for: "if a double quote appears inside the outer double quotes without a backslash prefixed, it doesn't match."
Some comments:
\A,\z: These match only at the beginning and end of the string. So the pattern has to match against the whole string, not a part of it.
(?<=): This is the syntax for positive lookbehind; it asserts that a pattern must match directly before the current position. So (?<=\\)" matches "a double quote which is preceded by a backslash".
[^"]: This matches "any character which is not a backslash".
One point about this regex, is that it will match an inner double quote which is preceded by two backslashes. If that is a problem, post a comment and I'll fix it.
If your version of Ruby doesn't have lookbehind, you could do something like:
/\A"(\\.|[^"\\])*"\z/
Note that unlike the first regexp, this one does not count a double backslash as escaping a quote (rather, the first backslash escapes the second one), so "\\"" will not match.
This works:
/"(?<method>[A-Z]*)\s*\\\"(?<file>[^\\"]*)\\""/
See it on Rubular.
Edit:
"(?<method>[A-Z]*)\s(?<content>(\\\"|[a-z\/\.]*)*)"
See it here.
Edit 2: without (? ...) sequence (for Ruby 1.8.6):
"([A-Z]*)\s((\\\"|[a-z\/\.]*)*)"
Rubular here.
Tested this on Rubular successfully:
\"GET \\\".*\\\"\"
Breakdown:
\" - Escape the " for the regex string, meaning the literal character "
GET - Assuming you just want GET than this is explicit
\\" - Escape \ and " to get the literal string \"
.* - 0 or more of any character other than \n
\\"\" - Escapes for the literal \""
I'm not sure a regex is really your best tool here, but if you insist on using one, I recommend thinking of the string as a sequence of tokens: a quote, then a series of things that are either \\, \" or anything that isn't a quote, then a closing quote at the end. So this:
^"(\\\\|\\"|[^"])*"$

Resources