What do the values $1, $2, $', $` mean in Ruby?
They're captures from the most recent pattern match (just as in Perl; Ruby initially lifted a lot of syntax from Perl, although it's largely gotten over it by now :). $1, $2, etc. refer to parenthesized captures within a regex: given /a(.)b(.)c/, $1 will be the character between a and b and $2 the character between b and c. $` and $' mean the strings before and after the string that matched the entire regex (which is itself in $&), respectively.
There is actually some sense to these, if only historically; you can find it in perldoc perlvar, which generally does a good job of documenting the intended mnemonics and history of Perl variables, and mostly still applies to the globals in Ruby. The numbered captures are replacements for the capture backreference regex syntax (\1, \2, etc.); Perl switched from the former to the latter somewhere in the 3.x versions, because using the backreference syntax outside of the regex complicated parsing too much. (By the time Perl 5 rolled around, the parser had been sufficiently rewritten that the syntax was again available, and promptly reused for references/"pointers". Ruby opted for using a name-quote : instead, which is closer to the Lisp and Smalltalk style; since Ruby started out as a Perl-alike with Smalltalk-style OO, this made more sense linguistically.) The same applies to $&, which in historical regex syntax is simply & (but you can't use that outside the replacement part of a substitution, so it became a variable $& instead). $` and $' are both "cutesy": "back-quote" and "forward-quote" from the matched string.
The non-numbered ones are listed here:
https://www.zenspider.com/ruby/quickref.html#pre-defined-variables
$1, $2 ... $N refer to matches in a regex capturing group.
So:
"ab:cd" =~ /([a-z]+):([a-z]+)/
Would yield
$1 = "ab"
$2 = "cd"
Related
This question already has answers here:
What is the meaning of the ${0##...} syntax with variable, braces and hash character in bash?
(4 answers)
Closed 2 years ago.
While looking online on how to get a file's extension and name, I found:
filename=$(basename "$fullfile")
extension="${filename##*.}"
filename="${filename%.*}
What is the ${} syntax...? I know regular expressions but "${filename##*.}" and "${filename%.*} escape my understanding.
Also, what's the difference between:
filename=$(basename "$fullfile")
And
filename=`basename "$fullfile"`
...?
Looking in Google is a nightmare, because of the strange characters...
The ${filename##*.} expression is parameter expansion ("parameters" being the technical name for the shell feature that other languages call "variables"). Plain ${varname} is the value of the parameter named varname, and if that's all you're doing, you can leave off the curly braces and just put $varname. But if you leave the curly braces there, you can put other things inside them after the name, to modify the result. The # and % are some of the most basic modifiers - they remove a prefix or suffix of the string that matches a wildcard pattern. # removes from the beginning, and % from the end; in each case, a single instance of the symbol removes the shortest matching string, while a double symbol matches the longest. So ${filename##*.} is "the value of filename with everything from the beginning to the last period removed", while ${filename%.*} is "the value of filename with everything from the last period to the end removed".
The backticks syntax (`...`) is the original way of doing command substitution in the Bourne shell, and has since been borrowed by languages like Perl and Ruby to incorporate calling out to system commands. But it doesn't deal well with nesting, and its attempt to even allow nesting means that quoting works differently inside them, and it's all very confusing. The newer $(...) syntax, originally introduced in the Korn shell and then adopted by Bash and zsh and codified by POSIX, lets quoting work the same at all levels of a nested substitution and makes for a nice symmetry with the ${...} parameter expansion.
As #e0k states in a comment on the question the ${varname...} syntax is Bash's parameter (variable) expansion. It has its own syntax that is unrelated to regular expressions; it encompasses a broad set of features that include:
specifying a default value
prefix and postfix stripping
string replacement
substring extraction
The difference between `...` and $(...) (both of which are forms of so-called command substitutions) is:
`...` is the older syntax (often called deprecated, but that's not strictly true).
$(...) is its modern equivalent, which facilitates nested use and works more intuitively when it comes to quoting.
See here for more information.
have
(use extras format posix posix-extras regex regex-literals utils srfi-1)
have regex with logical groupings 1 and 2
/^(\\W+)\\s+(\\W+)/
but am having trouble with the syntax to actually -use- 1 and 2 .
Should I be using $1 $2 , or \1 and \2 , or something else? I'll be using
1 and 2 on the same LOC as the regex itself.
Thanks in advance,
Still-learning Steve
This question is old and it isn't particularly clear: you don't explain how you've tried to use the regex. I'll attempt to answer it anyway.
First off, there are no "special" variables $1 or $2 like in Perl or Ruby. With that out of the way, it becomes a simple matter of how to use the various procedures.
For example, with string-match, you simply receive a list of matches:
#;1> (use regex regex-literals)
#;2> (string-match #/b(a)(r)/ "bar")
("bar" "a" "r")
So, to refer to the Nth submatch you'd use (list-ref the-match N) (where 0 equals the complete matched string).
With string-substitute and when using back references within the regex, you'd use "\\1" (you have to use two backslashes to escape the backslash in string context):
#;1> (use regex regex-literal)
#;2> (string-substitute #/f(o)(\1)/ "lala\\2xyz" "foo")
"lalaoxyz"
This works in general, but there's an undocumented feature (or perhaps a bug) that if you use a backslash in front of an escape sequence in the replacement, it will be escaped. See this bugreport on how that works, and how to use irregex instead of the regex egg to aovid this.
Can someone explain what $3 and $2 this syntax are when using coderay?
http://railscasts.com/episodes/207-syntax-highlighting?view=comments
require 'coderay'
def coderay(text)
text.gsub(/\<code( lang="(.+?)")?\>(.+?)\<\/code\>/m) do
content_tag("notextile", CodeRay.scan($3, $2).div(:css => :class))
end
end
I've also seen $4. Where are these defined, and what do they reference, and is there documentation for it?
I don't even know what the proper question is to ask about these. Basically... what are they? I must understand.
They are created by gsub, and called "captures". They will have the contents of what is matched by the parentheses in the regular expression. In your example, $1 will be what matches lang="(.+?)", $2 will be the match for .+? inside the lang attribute, and $3 the match for the other .+?, the tag contents. More precisely, $1 is a special global variable that will be identical to Regexp.last_match[1], which is, in turn, the same as Regexp.last_match.captures[0]. Similarly for the others.
You can find the Regexp-related special global variables reference in Regexp documentation.
It has nothing to do with CodeRay/RedCloth, and everything to do with regular expressions and core Ruby.
For the following variable:
var="/path/to/my/document-001_extra.txt"
i need only the parts between the / [slash] and the _ [underscore].
Also, the - [dash] needs to be stripped.
In other words: document 001
This is what I have so far:
var="${var##*/}"
var="${var%_*}"
var="${var/-/ }"
which works fine, but I'm looking for a more compact substitution pattern that would spare me the triple var=...
Use of sed, awk, cut, etc. would perhaps make more sense for this, but I'm looking for a pure bash solution.
Needs to work under GNU bash, version 3.2.51(1)-release
After editing your question to talk about patterns instead of regular expressions, I'll now show you how to actually use regular expressions in bash :)
[[ $var =~ ^.*/(.*)-(.*)_ ]] && var="${BASH_REMATCH[#]:1:2}"
Parameter expansions like you were using previously unfortunately cannot be nested in bash (unless you use ill-advised eval hacks, and even then it will be less clear than the line above).
The =~ operator performs a match between the string on the left and the regular expression on the right. Parentheses in the regular expression define match groups. If a match is successful, the exit status of [[ ... ]] is zero, and so the code following the && is executed. (Reminder: don't confuse the "0=success, non-zero=failure" convention of process exit statuses with the common Boolean convention of "0=false, 1=true".)
BASH_REMATCH is an array parameter that bash sets following a successful regular-expression match. The first element of the array contains the full text matched by the regular expression; each of the following elements contains the contents of the corresponding capture group.
The ${foo[#]:x:y} parameter expansion produces y elements of the array, starting with index x. In this case, it's just a short way of writing ${BASH_REMATCH[1]} ${BASH_REMATCH[2]}. (Also, while var=${BASH_REMATCH[*]:1:2} would have worked as well, I tend to use # anyway to reinforce the fact that you almost always want to use # instead of * in other contexts.)
Both of the following should work correctly. Though the second is sensitive to misplaced characters (if you have a / or - after the last _ it will fail).
var=$(IFS=_ read s _ <<<"$var"; IFS=-; echo ${s##*/})
var=$(IFS=/-_; a=($var); echo "${a[#]:${#a[#]} - 3:2}")
I have a regex that gives me one result in sed but another in Perl (and Ruby).
I have the string one;two;;three and I want to highlight the substrings delimited by the ;. So I do the following in Perl:
$a = "one;two;;three";
$a =~ s/([^;]*)/[\1]/g;
print $a;
(Or, in Ruby: print "one;two;;three".gsub(/([^;]*)/, "[\\1]").)
The result is:
[one][];[two][];[];[three][]
(I know the reason for the spurious empty substrings.)
Curiously, when I run the same regexp in sed I get a different result. I run:
echo "one;two;;three" | sed -e 's/[^;]*/[\0]/g'
and I get:
[one];[two];[];[three]
What is the reason for this different result?
EDIT:
Somebody replied "because sed is not perl". I know that. The reason I'm asking my question is because I don't understand how sed copes so well with zero-length matches.
This is an interesting and surprising edge case.
Your [^;]* pattern may match the empty string, so it becomes a philosophy question, viz., how many empty strings are between two characters: zero, one, or many?
sed
The sed match clearly follows the philosophy described in the “Advancing After a Zero–Length Regex Match” section of “Zero–Length Regex Matches.”
Now the regex engine is in a tricky situation. We’re asking it to go through the entire string to find all non–overlapping regex matches. The first match ended at the start of the string, where the first match attempt began. The regex engine needs a way to avoid getting stuck in an infinite loop that forever finds the same zero-length match at the start of the string.
The simplest solution, which is used by most regex engines, is to start the next match attempt one character after the end of the previous match, if the previous match was zero–length.
That is, zero empty strings are between characters.
The above passage is not an authoritative standard, and quoting such a document instead would make this a better answer.
Inspecting the source of GNU sed, we see
/* Start after the match. last_end is the real end of the matched
substring, excluding characters that were skipped in case the RE
matched the empty string. */
start = offset + matched;
last_end = regs.end[0];
Perl and Ruby
Perl’s philosophy with s///, which Ruby seems to share—so the documentation and examples below use Perl to represent both—is there is exactly one empty string after each character.
The “Regexp Quote–Like Operators” section of the perlop documentation reads
The /g modifier specifies global pattern matching—that is, matching as many times as possible within the string.
Tracing execution of s/([^;]*)/[\1]/g gives
Start. The “match position,” denoted by ^, is at the beginning of the target string.
o n e ; t w o ; ; t h r e e
^
Attempt to match [^;]*.
o n e ; t w o ; ; t h r e e
^
Note that the result captured in $1 is one.
Attempt to match [^;]*.
o n e ; t w o ; ; t h r e e
^
Important Lesson: The * regex quantifier always succeeds because it means “zero or more.” In this case, the substring in $1 is the empty string.
The rest of the match proceeds as in the above.
Being a perceptive reader, you now ask yourself, “Self, if * always succeeds, how does the match terminate at the end of the target string, or for that matter, how does it get past even the first zero–length match?”
We find the answer to this incisive question in the “Repeated Patterns Matching a Zero–length Substring” section of the perlre documentation.
However, long experience has shown that many programming tasks may be significantly simplified by using repeated subexpressions that may match zero–length substrings. Here’s a simple example being:
#chars = split //, $string; # // is not magic in split
($whitewashed = $string) =~ s/()/ /g; # parens avoid magic s// /
Thus Perl allows such constructs, by forcefully breaking the infinite loop. The rules for this are different for lower–level loops given by the greedy quantifiers *+{}, and for higher-level ones like the /g modifier or split operator.
…
The higher–level loops preserve an additional state between iterations: whether the last match was zero–length. To break the loop, the following match after a zero–length match is prohibited to have a length of zero. This prohibition interacts with backtracking … and so the second best match is chosen if the best match is of zero length.
Other Perl approaches
With the addition of a negative lookbehind assertion, you can filter the spurious empty matches.
$ perl -le '$a = "one;two;;three";
$a =~ s/(?<![^;])([^;]*)/[\1]/g;
print $a;'
[one];[two];[];[three]
Apply what Mark Dominus dubbed Randal’s Rule, “Use capturing when you know what you want to keep. Use split when you know what you want to throw away.” You want to throw away the semicolons, so your code becomes more direct with
$ perl -le '$a = "one;two;;three";
$a = join ";", map "[$_]", split /;/, $a;
print $a;'
[one];[two];[];[three]
From the source code for sed-4.2 for the substitute function:
/sed/execute.c
/* If we're counting up to the Nth match, are we there yet?
And even if we are there, there is another case we have to
skip: are we matching an empty string immediately following
another match?
This latter case avoids that baaaac, when passed through
s,a*,x,g, gives `xbxxcx' instead of xbxcx. This behavior is
unacceptable because it is not consistently applied (for
example, `baaaa' gives `xbx', not `xbxx'). */
This indicates that the behavior we see in Ruby and Perl was consciously avoided in sed. This is not due to any fundamental difference between the languages but a result of special handling in sed
There's something else going on in the perl (and presumably ruby) scripts as that output makes no sense for simply handling the regexp as a BRE or ERE.
awk (EREs) and sed (BREs) behave as they should for just doing an RE replacement:
$ echo "one;two;;three" | sed -e 's/[^;]*/[&]/g'
[one];[two];[];[three]
$ echo "one;two;;three" | awk 'gsub(/[^;]*/,"[&]")'
[one];[two];[];[three]
You said I know the reason for the spurious empty substrings.. Care to clue us in?