multi-character parentheses in syntax table - elisp

I know that I can have multi-character comment delimiters, but how about multi-character parentheses, e.g. if I wanted to treat the character sequence "%{" as an opening parenthesis and "}%" as a closing one like this
%{
}%

Emacs does not support this, see the info entry for syntax tables. Comments (and strings) have matching delimiters - but they also have corresponding "generic" syntax classes. The parenthesis class doesn't have similar support.
So, you're stuck with only single character parentheses.

Related

Double escape characters in elisp regex patterns

(regexp-opt '("this" "that"))
returns,
"\\(?:th\\(?:at\\|is\\)\\)
Why there are double backward slashes everywhere in this elisp regex. Doesn't elisp regex use single backward slash?
And, ? symbol is a postfix operator in regex patterns which means it acts upon the characters that precedes it..(http://www.gnu.org/software/emacs/manual/html_node/elisp/Regexp-Special.html#Regexp-Special). but here, there are no expressions before the ? operator. so, what does
(?:th\\
part mean in this regex.
The backslash is part of the regexp syntax. But to preserve it as part of a regexp string, you need to protect it with another backslash, as documented in the syntax for strings documentation:
'Likewise, you can include a backslash by preceding it with another backslash, like this: "this \\ is a single embedded backslash".'
As for the ?: construct, it's how you specify a non-capturing or "shy" group:
"A shy group serves the first two purposes of an ordinary group (controlling the nesting of other operators), but it does not get a number, so you cannot refer back to its value with ‘\digit’. Shy groups are particularly useful for mechanically-constructed regular expressions, because they can be added automatically without altering the numbering of ordinary, non-shy groups."
It's documented as part of the regexp backslash documentation. As the passage quoted above explains, it's useful in functions like regexp-opt for grouping patterns without creating capture groups.

Minus sign that appears not to designate a range and escaping a closing parenthesis

I have 2 questions regarding the following regex from Why's Poignant Guide to Ruby:
1: What does the minus sign mean here? It doesn't seem to be designating a range because there is nothing to the left of it other than the bracket.
2: Why is it necessary to escape the closing parenthesis? After you escape the opening one, what special meaning could the closing parenthesis have?
/\([-\w]+\)/
1)When the minus sign is at the begining or at the end of a character class, it is seen as literal.
2) escaping closing parenthesis is a convention. The goal is IMO, to avoid an ambiguity with a possible opening parenthesis before. Consider these examples:
/(\([-\w]+\))/ or /(\([-\w]+)\)/
1) The minus sign is a literal minus sign. Since it cannot possibly designate a range, it has no special meaning and so the character class is equivalent to [\-\w] - escaping the hyphen is optional, as you observe in your second point...
2) ...however, it isn't always good form to not escape something just because the regular expression engine allows it. For example, this regex: ([([^)-]+) is perfectly valid (I think...) but entirely unclear because of the fact that characters which normally have special meanings are used as literal characters without being escaped. Valid, yes, but not obvious, and someone who doesn't know all the rules will become very confused trying to understand it.
The the minus sign -, or say the hyphen, means exact just the character -. The hyphen can be included right after the opening bracket, or right before the closing bracket, or right after the negating caret. It's not designating a range, so it's not confusing. You can also choose to use \- if you like.
As to why to escape ), I think it means to reduce the regex engine's work so that it doesn't have to remember if an opening parenthesis is before.
- sign in this regex actually means a - sign that you want to see in the text.
Non-escaped parentheses means a match group, that will be available for you, for example, by $1 variable.
> "(-w)" =~ /\([-\w]+\)/
> $1 # => nil
and
> "(-w)" =~ /([-\w]+)/
> $1 # => -w
You can go to Rubular and try both regexes \([-\w]+\) and ([-\w]+) - and you will see different results by passing (-w) as a test. You can notice match groups appearing.

Matching an unescaped balanced pair of delimiters

How can I match a balanced pair of delimiters not escaped by backslash (that is itself not escaped by a backslash) (without the need to consider nesting)? For example with backticks, I tried this, but the escaped backtick is not working as escaped.
regex = /(?!<\\)`(.*?)(?!<\\)`/
"hello `how\` are` you"
# => $1: "how\\"
# expected "how\\` are"
And the regex above does not consider a backslash that is escaped by a backslash and is in front of a backtick, but I would like to.
How does StackOverflow do this?
The purpose of this is not much complicated. I have documentation texts, which include the backtick notation for inline code just like StackOverflow, and I want to display that in an HTML file with the inline code decorated with some span material. There would be no nesting, but escaped backticks or escaped backslashes may appear anywhere.
Lookbehind is the first thing everyone thinks of for this kind of problem, but it's the wrong tool, even in flavors like .NET that support unrestricted lookbehinds. You can hack something up, but it's going to be ugly, even in .NET. Here's a better way:
`[^`\\]*(\\.[^`\\]*)*`
The first part starts from the opening delimiter and gobbles up anything that's not the delimiter or a backslash. If the next character is a backslash, it consumes that and the character following it, whatever it may be. It could be the delimiter character, another backslash, or anything else, it doesn't matter.
It repeats those steps as many times as necessary, and when neither [^`\\] nor \\. can match, the next character must be the closing delimiter. Or the end of the string, but I'm assuming the input is well formed. But if it's not well formed, this regex will fail very quickly. I mention that because of this other approach I see a lot:
`(?:[^`\\]+|\\.)*`
This works fine on well-formed input, but what happens if you remove the last backtick from your sample input?
"hello `how\` are you"
According to RegexBuddy, after encountering the first backtick, this regex performed 9,252 distinct operations (or steps) before it could give up and report failure; mine failed in ten steps.
EDIT To extract just the par inside the delimiters, wrap that part in a capturing group. You'll still have to remove the backslashes manually.
`([^`\\]*(?:\\.[^`\\]*)*)`
I also changed the other group to non-capturing, which I should have done from the start. I don't avoid capturing religiously, but if you are using them to capture stuff, any other groups you use should be non-capturing.
EDIT I think I've been reading too much into the question. On StackOverflow, if you want to include literal backticks in an inline-code segment or a comment, you use three backticks as the the delimiter, not just one. Since there's no need to escape backticks, you can ignore backslashes as well. Your regex could turn out to be as simple as this:
```(.*?)```
Dealing with the possibility of false delimiters, you use the same basic technique:
```([^`]*(?:`(?!``)[^`]*)*)```
Is this what you're after?
By the way, this answer doesn't contradict #nneonneo's comment above. This answer doesn't consider the context in which the match is taking place. Is it in the source code of a program or web page? If it is, did the match occur inside a comment or a string literal? How do I even know the first backtick I found wasn't escaped? Regexes don't know anything about the context in which they operate; that's what parsers are for.
If you don't need nesting, regexes can indeed be a proper tool. Lexers of programming languages, for instance, use regexes to tokenize strings, and strings usually allow their own delimiters as an escaped content. Anything more complicated than that will probably need a full-blown parser though.
The "general formula" is to match an escaped character (\\.) or any character that's valid as content but don't need to be escaped ([^{list of invalid chars}]). A "naïve" solution would be joining them with or (|), but for a more efficient variant see #AlanMoore's answer.
The complete example is shown below, in two variants: the first assumes than backslashes should only be used for escaping inside the string, the second assumes that a backslash anywhere in the text escapes the next character.
`((?:\\.|[^`\\])*)`
(?:\\.|[^`\\])*`((?:\\.|[^`\\])*)`
Working examples here and here. However, as #nneonneo commented (and I endorsed), regexes are not meant to do a complete parse, so you'd better keep things simple if you want them to work out right (do you want to find a token in the text, or do you want to delimit it already knowing where it starts? The answer to that question is important to decide which strategy works best for your case).

Matching braces in ruby with a character in front

I have read quite a few posts here for matching nested braces in Ruby using Regexp. However I cannot adapt it to my situation and I am stuck. The Ruby 1.9 book uses the following to match a set of nested braces
/\A(?<brace_expression>{([^{}]|\g<brace_expression>)*})\Z/x
I am trying to alter this in three ways. 1. I want to use parentheses instead of braces, 2. I want a character in front (such as a hash symbol), and 3. I want to match anywhere in the string, not just beginning and end. Here is what I have so far.
/(#(?<brace_expression>\(([^\(\)]|\g<brace_expression>)*\)))/x
Any help in getting the right expression would be appreciated.
Using the regex modifier x enables comments in the regex. So the # in your regex is interpreted as a comment character and the rest of the regex is ignored. You'll need to either escape the # or remove the x modifier.
Btw: There's no need to escape the parentheses inside [].

Does reserved word 'then' can always be replaced with semicolon or linebreak?

Does using then instead of semicolon or linebreak have only decorative purpose (to make code more readable)?
The keyword then can appear in two places in ruby: if statements and case statements. In both cases it can be replaced with a linebreak or a semicolon.
So yes, it's merely decorative.
In if expressions and case expressions, the condition is terminated either with the then keyword or an expression separator (i.e. semicolon or newline).
So, yes it can always be replaced with semicolon or linebreak.
And no, it does not have only decorative purpose, it separates the consequence from the condition in an if or case expression.

Resources