Double escape characters in elisp regex patterns - elisp

(regexp-opt '("this" "that"))
returns,
"\\(?:th\\(?:at\\|is\\)\\)
Why there are double backward slashes everywhere in this elisp regex. Doesn't elisp regex use single backward slash?
And, ? symbol is a postfix operator in regex patterns which means it acts upon the characters that precedes it..(http://www.gnu.org/software/emacs/manual/html_node/elisp/Regexp-Special.html#Regexp-Special). but here, there are no expressions before the ? operator. so, what does
(?:th\\
part mean in this regex.

The backslash is part of the regexp syntax. But to preserve it as part of a regexp string, you need to protect it with another backslash, as documented in the syntax for strings documentation:
'Likewise, you can include a backslash by preceding it with another backslash, like this: "this \\ is a single embedded backslash".'
As for the ?: construct, it's how you specify a non-capturing or "shy" group:
"A shy group serves the first two purposes of an ordinary group (controlling the nesting of other operators), but it does not get a number, so you cannot refer back to its value with ‘\digit’. Shy groups are particularly useful for mechanically-constructed regular expressions, because they can be added automatically without altering the numbering of ordinary, non-shy groups."
It's documented as part of the regexp backslash documentation. As the passage quoted above explains, it's useful in functions like regexp-opt for grouping patterns without creating capture groups.

Related

Escape characters in bash & expect script [duplicate]

I am using Tcl_StringCaseMatch function in C++ code for string pattern matching. Everything works fine until input pattern or string has [] bracket. For example, like:
str1 = pq[0]
pattern = pq[*]
Tcl_StringCaseMatch is not working i.e returning false for above inputs.
How to avoid [] in pattern matching?
The problem is [] are special characters in the pattern matching. You need to escape them using a backslash to have them treated like plain characters
pattern= "pq\\[*\\]"
I don't think this should affect the string as well. The reason for double slashing is you want to pass the backslash itself to the TCL engine.
For the casual reader:
[] have a special meaning in TCL in general, beyond the pattern matching role they take here - "run command" (like `` or $() in shells), but [number] will have no effect, and the brackets are treated normally - thus the string str1 does not need escaping here.
For extra confusion:
TCL will interpret ] with no preceding [ as a normal character by default. I feel that's getting too confusing, and would rather that TCL complains on unbalanced brackets. As OP mentions though, this allows you to forgo the final two backslashes and use "pq\\[*]". I dislike this, and rather make it obvious both are treated normally and not the usual TCL way, but to each her/is own.

Lookahead containing the same token as left/right anchors

Got a variation of the classic "regex quoted strings" problem. I need to pick out strings that look like this:
"foo bar bar"
from a long string like this
token token "maybe quoted token that can also contain spaces"
Each of the tokens can be quoted or unquoted (this is easy to take care of using alternating groups) but sometimes I have quoted strings which have literal quotes inside them (not escaped in any way),
the only useable thing being that those quotes never have spaces on either side (since that would
create a delimiter). Those tokens look like this: "foo-bar"baz"
My initial thought was /"(?:[^"]|" )*"/ but that doesn't seem to work because a token like this: "here is some"quotes" gets split in two.
How should I do this? Platform is Ruby 2.1
Use this:
"(?:[^"]|"\w)+"
or
"(?:[^"]|"\S)+"
You can play with sample strings in the regex demo.
Explanation
" matches the opening quote
The non-capturing group(?:start [^"]|"\w) matches...
One [^"] non-quote character, OR |
One quote and a word character "\w
+ one or more times
" closing quote
Further Refinements
If you want to allow quotes in other contexts, for instance escaped quotes, just add them to the alternation:
"(?:\\"|[^"]|"\w)+"
To allow quotes to be followed not just by a word char but any non-space:
"(?:\\"|[^"]|"\S)+"
This one may also suit your needs:
".*?"(?!\S)
Debuggex Demo
To match also non-quoted tokens:
".*?"(?!\S)|\S+
Debuggex Demo

How to check how many variables (masks) declared in Regexp (ruby)?

Let's say I have a regexp with some arbitrary amount of capturing groups:
pattern = /(some)| ..a lot of masks combined.... |(other)/
Is there any way to determine a number of that groups?
If you can always find a string that matches the regex you are given, then it suffices to match it against the regex, and look at the match data length. However, determining whether a regexp has a string that it matches is np-hard[1]. This is only feasible if you know in advance what kind of regexes you'll be getting.
The next best best method in the Regexp class is Regexp#source or Regexp#to_s. However, we need to parse the regex if we do this.
I can't speak for the future, but as of Ruby 2.0, there is no better method in the Regexp core class.
A left parenthesis denotes a literal left parenthesis, if preceded by an unescaped backslash. A backslash is unescaped unless an unescaped backslash precedes. So, a character is escaped iff preceded by an odd number of backslashes.
An unescaped left parenthesis denotes a capturing group iff not followed by a question mark. With a question mark, it can mean various things: (?'name') and (?<name>) denote a named capturing group. Named and unnamed capturing groups cannot coexist in the same regex, however[2]. (?:) denote non-capturing groups. This is a special case of (?flags-flags:). (?>) denote atomic groups. (?=), (?!), (?<=) and (?<!) denote lookaround. (?#) denote comments.
Ruby regexp engine supports comments in regexes. Considering them in the main regex would be very difficult. We can try to strip them if we really want to support these, but supporting them fully will get messy due to the possibility of inline flags turning extended mode (and thus line comments) on and off in ways that a regular expression cannot capture. I will go ahead and not support unescaped parentheses in regex comments[3].
We want to count:
the number of left parentheses \(
that are not escaped by a backslash (?<!(?<!\\)(?:\\\\)*\\) (read: not preceded by an odd number of backslashes that are not preceded by yet another backslash) and
that are not followed by a question mark (?!\?)
Ruby doesn't support unbounded lookbehind, but if we reverse the source first, we can rewrite the first assertion slightly: (?!(?:\\\\)*(?!\\)). The second assertion becomes a lookbehind: (?<!\?).
the whole solution
def count_groups(regexp)
# named capture support:
# named_count = regexp.named_captures.count
# return named_count if named_count > 0
# main:
test = /(?!<\?)\((?!(?:\\\\)*(?!\\))/
regexp.source.scan(test).count
end
[1]: we can show the NP-hardness by converting the satisfiability problem to it:
AND: xy (x must be an assertion)
OR: x|y
NOT: (?!x)
atoms: (?=1), (?=.1), (?=..1), ..., (?!1), (?!.1)...
example(XOR): /^(?:(?=1)(?!.1)|(?!1)(?=.1))..$/
this extends to NP-completeness for any class of regexes that can be tested in polynomial time. This includes any regex with no nested repetition (or repeated backreferences to repetition or recursion) and with bounded nesting depth of optional matches.
[2]: /((?<name>..)..)../.match('abcdef').to_a returns ['abcdef', 'ab'], indicating that unnamed capturing groups are ignored when named capturing groups are present. Tested in Ruby 1.9.3
[3]: Inline comments start with (?# and end with ). They cannot contain an unescaped right parenthesis, but they can contain an unescaped left parenthesis. These can be stripped easily (even though we have to sprinkle the "unescaped" regex everywhere), are the lesser evil, but they're also less likely to contain anunescaped left parenthesis.
Line comments start with # and end with a newline. These are only treated as comments in the extended mode. Outside the extended mode, they match the literal # and newline. This is still easy, even if we have to consider escaping again. Determining if the regex has the extended flag set is not too difficult, but the flag modifier groups are a different beast entirely.
Even with Ruby's awesome recursive regexes, merely determining if a previously-open group modifying the extended mode is already closed would yield a very nasty regex (even if you replace one by one and don't have to skip comments, you have to account for escaping). It wouldn't be pretty (even with interpolation) and it wouldn't be fast.

Matching an unescaped balanced pair of delimiters

How can I match a balanced pair of delimiters not escaped by backslash (that is itself not escaped by a backslash) (without the need to consider nesting)? For example with backticks, I tried this, but the escaped backtick is not working as escaped.
regex = /(?!<\\)`(.*?)(?!<\\)`/
"hello `how\` are` you"
# => $1: "how\\"
# expected "how\\` are"
And the regex above does not consider a backslash that is escaped by a backslash and is in front of a backtick, but I would like to.
How does StackOverflow do this?
The purpose of this is not much complicated. I have documentation texts, which include the backtick notation for inline code just like StackOverflow, and I want to display that in an HTML file with the inline code decorated with some span material. There would be no nesting, but escaped backticks or escaped backslashes may appear anywhere.
Lookbehind is the first thing everyone thinks of for this kind of problem, but it's the wrong tool, even in flavors like .NET that support unrestricted lookbehinds. You can hack something up, but it's going to be ugly, even in .NET. Here's a better way:
`[^`\\]*(\\.[^`\\]*)*`
The first part starts from the opening delimiter and gobbles up anything that's not the delimiter or a backslash. If the next character is a backslash, it consumes that and the character following it, whatever it may be. It could be the delimiter character, another backslash, or anything else, it doesn't matter.
It repeats those steps as many times as necessary, and when neither [^`\\] nor \\. can match, the next character must be the closing delimiter. Or the end of the string, but I'm assuming the input is well formed. But if it's not well formed, this regex will fail very quickly. I mention that because of this other approach I see a lot:
`(?:[^`\\]+|\\.)*`
This works fine on well-formed input, but what happens if you remove the last backtick from your sample input?
"hello `how\` are you"
According to RegexBuddy, after encountering the first backtick, this regex performed 9,252 distinct operations (or steps) before it could give up and report failure; mine failed in ten steps.
EDIT To extract just the par inside the delimiters, wrap that part in a capturing group. You'll still have to remove the backslashes manually.
`([^`\\]*(?:\\.[^`\\]*)*)`
I also changed the other group to non-capturing, which I should have done from the start. I don't avoid capturing religiously, but if you are using them to capture stuff, any other groups you use should be non-capturing.
EDIT I think I've been reading too much into the question. On StackOverflow, if you want to include literal backticks in an inline-code segment or a comment, you use three backticks as the the delimiter, not just one. Since there's no need to escape backticks, you can ignore backslashes as well. Your regex could turn out to be as simple as this:
```(.*?)```
Dealing with the possibility of false delimiters, you use the same basic technique:
```([^`]*(?:`(?!``)[^`]*)*)```
Is this what you're after?
By the way, this answer doesn't contradict #nneonneo's comment above. This answer doesn't consider the context in which the match is taking place. Is it in the source code of a program or web page? If it is, did the match occur inside a comment or a string literal? How do I even know the first backtick I found wasn't escaped? Regexes don't know anything about the context in which they operate; that's what parsers are for.
If you don't need nesting, regexes can indeed be a proper tool. Lexers of programming languages, for instance, use regexes to tokenize strings, and strings usually allow their own delimiters as an escaped content. Anything more complicated than that will probably need a full-blown parser though.
The "general formula" is to match an escaped character (\\.) or any character that's valid as content but don't need to be escaped ([^{list of invalid chars}]). A "naïve" solution would be joining them with or (|), but for a more efficient variant see #AlanMoore's answer.
The complete example is shown below, in two variants: the first assumes than backslashes should only be used for escaping inside the string, the second assumes that a backslash anywhere in the text escapes the next character.
`((?:\\.|[^`\\])*)`
(?:\\.|[^`\\])*`((?:\\.|[^`\\])*)`
Working examples here and here. However, as #nneonneo commented (and I endorsed), regexes are not meant to do a complete parse, so you'd better keep things simple if you want them to work out right (do you want to find a token in the text, or do you want to delimit it already knowing where it starts? The answer to that question is important to decide which strategy works best for your case).

Backslash + captured group within Ruby regular expression

How do I excape a backslash before a captured group?
Example:
"foo+bar".gsub(/(\+)/, '\\\1')
What I expect (and want):
foo\+bar
what I unfortunately get:
foo\\1bar
How do I escape here correctly?
As others have said, you need to escape everything in that string twice. So in your case the solution is to use '\\\\\1' or '\\\\\\1'. But since you asked why, I'll try to explain that part.
The reason is that replacement sequence is being parsed twice--once by Ruby and once by the underlying regular expression engine, for whom \1 is its own escape sequence. (It's probably easier to understand with double-quoted strings, since single quotes introduce an ambiguity where '\\1' and '\1' are equivalent but '\' and '\\' are not.)
So for example, a simple replacement here with a captured group and a double quoted string would be:
"foo+bar".gsub(/(\+)/, "\\1") #=> "foo+bar"
This passes the string \1 to the regexp engine, which it understands as a reference to a capture group. In Ruby string literals, "\1" means something else entirely (ASCII character 1).
What we actually want in this case is for the regexp engine to receive \\\1. It also understands \ as an escape character, so \\1 is not sufficient and will simply evaluate to the literal output \1. So, we need \\\1 in the regexp engine, but to get to that point we need to also make it past Ruby's string literal parser.
To do that, we take our desired regexp input and double every backslash again to get through Ruby's string literal parser. \\\1 therefore requires "\\\\\\1". In the case of single quotes one slash can be omitted as \1 is not a valid escape sequence in single quotes and is treated literally.
Addendum
One of the reasons this problem is usually hidden is thanks to the use of /.+/ style regexp quotes, which Ruby treats in a special way to avoid the need to double escape everything. (Of course, this doesn't apply to gsub replacement strings.) But you can still see it in action if you use a string literal instead of a regexp literal in Regexp.new:
Regexp.new("\.").match("a") #=> #<MatchData "a">
Regexp.new("\\.").match("a") #=> nil
As you can see, we had to double-escape the . for it to be understood as a literal . by the regexp engine, since "." and "\." both evaluate to . in double-quoted strings, but we need the engine itself to receive \..
This happens due to a double string escaping. You should use 5 slashes in this case.
"foo+bar".gsub(/([+])/, '\\\\\1')
Adding \ two more times escapes this properly.
irb(main):011:0> puts "foo+bar".gsub(/(\+)/, '\\\\\1')
foo\+bar
=> nil

Resources