What are $3 $2 in coderay/redcloth/textile/markdown? - ruby

Can someone explain what $3 and $2 this syntax are when using coderay?
require 'coderay'
def coderay(text)
text.gsub(/\<code( lang="(.+?)")?\>(.+?)\<\/code\>/m) do
content_tag("notextile", CodeRay.scan($3, $2).div(:css => :class))
I've also seen $4. Where are these defined, and what do they reference, and is there documentation for it?
I don't even know what the proper question is to ask about these. Basically... what are they? I must understand.

They are created by gsub, and called "captures". They will have the contents of what is matched by the parentheses in the regular expression. In your example, $1 will be what matches lang="(.+?)", $2 will be the match for .+? inside the lang attribute, and $3 the match for the other .+?, the tag contents. More precisely, $1 is a special global variable that will be identical to Regexp.last_match[1], which is, in turn, the same as Regexp.last_match.captures[0]. Similarly for the others.
You can find the Regexp-related special global variables reference in Regexp documentation.
It has nothing to do with CodeRay/RedCloth, and everything to do with regular expressions and core Ruby.


Extract a substring (value of an HTML node tag) in a bash/zsh script

I'm trying to extract a tag value of an HTML node that I already have in a variable.
I'm currently using Zsh but I'm trying to make it work in Bash as well.
The current variable has the value:
<span class="alter" fill="#ffedf0" data-count="0" data-more="none"/>
and I would like to get the value of data-count (in this case 0, but could be any length integer).
I have tried using cut, sed and the variables expansion as explained in this question but I haven't managed to adapt the regexs, or maybe it has to be done differently for Zsh.
There is no reason why sed would not work in this situation. For your specific case, I would do something like this:
sed 's/.*data-count="\([0-9]*\)".*/\1/g' file_name.txt
Basically, it just states that sed is looking for the a pattern that contains data-count=, then saves everything within the paranthesis \(...\) into \1, which is subsequently printed in place of the match (full line due to the .*)
Could you please try following.
awk 'match($0,/data-count=[^ ]*/){print substr($0,RSTART+12,RLENGTH-13)}' Input_file
Explanation: Using match function of awk to match regex data-count=[^ ]* means match everything from data-count till a space comes, if this regex is TRUE(a match is found) then out of the box variables RSTART and RLENGTH will be set. Later I am printing current line's sub-string as per these variables values to get only value of data-count.
With sed could you please try following.
sed 's/.*data-count=\"\([^"]*\).*/\1/' Input_file
Explanation: Using sed's capability of group referencing and saving regex value in first group after data-count=\" which is its length, then since using s(substitution) with sed so mentioning 1 will replace all with \1(which is matched regex value in temporary memory, group referencing).
As was said before, to be on the safe side and handle any syntactically valid HTML tag, a parser would be strongly advised. But if you know in advance, what the general format of your HTML element will look like, the following hack might come handy:
Assume that your variable is called "html"
html='<span class="alter" fill="#ffedf0" data-count="0" data-more="none"/>'
First adapt it a bit:
htmlx="tag ${html%??}"
This will add the string tag in front and remove the final />
Now make an associative array:
declare -A fields
fields=( ${=$(tr = ' ' <<<$htmlx)} )
The tr turns the equal sign into a space and the ${= handles word splitting. You can now access the values of your attributes by, say,
echo $fields[data-count]
Note that this still has the surrounding double quotes. Yuo can easily remove them by
echo ${${fields[data-count]%?}#?}
Of course, once you do this hack, you have access to all attributes in the same way.

Regex not working as field separator on awk

I have this text file foo.txt which contains words mixed with punctuation marks.
What I want to do is filter every punctuation mark using awk, so I used a regex expression as field separator, like this awk -F '[^a-zA-Z]+' '{ print $0 }' foo.txt, the problem I'm facing is that the text stays just like the original, nothing is filtered.
Anyone knows why this happens?
¿Hello? How... are foo you?'
Bye ,, hehe '" .lol
Result Expected
Hello How are foo you
Bye hehe lol
I know I can achieve the same result using sed with something like this sed 's/[[:punct:]]//g' foo.txt or sed s/[^A-Za-z]/" "/g foo.txt, but I want to know why the awk command is not working, I've already investigated everywhere and I can't find an answer, I'm not going to be able to sleep.
If you want to know where you can find the rules behind this, I would like to point to Awk POSIX standard:
However, you have to find the answer a bit on two locations:
The awk utility shall interpret each input record as a sequence of fields where, by default, a field is a string of non- <blank> non- <newline> characters. This default <blank> and <newline> field delimiter can be changed by using the FS built-in variable or the -F sepstring option. The awk utility shall denote the first field in a record $1, the second $2, and so on. The symbol $0 shall refer to the entire record; setting any other field causes the re-evaluation of $0. Assigning to $0 shall reset the values of all other fields and the NF built-in variable.
Variables and Special Variables
References to nonexistent fields (that is, fields after $NF), shall evaluate to the uninitialized value. Such references shall not create new fields. However, assigning to a nonexistent field (for example, $(NF+2)=5) shall increase the value of NF; create any intervening fields with the uninitialized value; and cause the value of $0 to be recomputed, with the fields being separated by the value of OFS. Each field variable shall have a string value or an uninitialized value when created. Field variables shall have the uninitialized value when created from $0 using FS and the variable does not contain any characters.
It is a bit awkward to find the rule for recomputing $0 when new fields are introduced, but this is essentially the rule.
Furthermore, the statement print $0 prints the entire field. So according to the above, you first need to recompute your $0 as shown in the answer of #oguzismail.
So changing the field separator can be done in the following way:
awk 'BEGIN{FS="oldFS"; OFS="newFS"}{$1=$1}1' <file>
remark: you do not need to check if the line contains any fields as NF{$1=$1} since {$1=$1} will just introduce an empty field without an extra OFS.

What does this variable assignment do?

I'm having to code a subversion hook script, and I found a few examples online, mostly python and perl. I found one or two shell scripts (bash) as well. I am confused by a line and am sorry this is so basic a question.
The script later uses this to perform a test, such as (assume EXT=ex):
if [[ "$FILTER" == *"$EXT"* ]]; then blah
My problem is the above test is true. However, I'm not asking you to assist in writing the script, just explaining the initial assignment of FILTER. I don't understand that line.
Editing in a closer example FILTER line. Of course the script, as written does not work, because 'ex' returns true, and not just 'exe'. My problem here is only, however, that I don't understant the layout of the variable assignment itself.
Why is there a period at the beginning? ".(sh..."
Why is there a dollar sign at the end? "...BAT)$"
Why are there pipes between each pattern? "sh|SH|exe"
You probably looking for something as next:
for EXT
if [[ "$EXT" =~ $FILTER ]];
echo $EXT extension disallowed
echo $EXT is allowed
save it to myscript.sh and run it as
myscript.sh bash ba.sh
and will get
bash is allowed
ba.sh extension disallowed
If you don't escape the "dot", e.g. with the FILTER=".(sh|SH|exe|EXE|bat|BAT)$" you will get
bash extension disallowed
ba.sh extension disallowed
What is (of course) wrong.
For the questions:
Why is there a period at the beginning? ".(sh..."
Because you want match .sh (as extension) and not for example bash (without the dot). And therefore the . must be escaped, like \. because the . in regex mean "any character.
Why is there a dollar sign at the end? "...BAT)$"
The $ mean = end of string. You want match file.sh and not file.sh.jpg. The .sh should be at the end of string.
Why are there pipes between each pattern? "sh|SH|exe"
In the rexex, the (...|...|...) construction delimites the "alternatives". As you sure quessed.
You really need read some "regex tutorial" - it is more complicated - and can't be explained in one answer.
Ps: NEVER use UPPERCASE variable names, they can collide with environment variables.
This just assigns a string to FILTER; the contents of that string have no special meaning. When you try to match it against the pattern *ex*, the result is true assuming that the value of $FILTER consists the string ex surrounded by anything on either side. This is true; ex is a substring of exe.
+---- here is the "ex" from the pattern.
As I can this is similar to regular expression pattern:
In regular expressions the string start with can be show with ^, similarly in this case . represent seems doing that.
In the bracket you have exact string, which represents what the exact file extensions would be matched, they are 'Or' by using the '|'.
And at the end the expression should only pick the string will '$' or end point and not more than.
I would say that way original author might have looked at it and implemented it.

Ruby string containing ${...}

In the Ruby string :
"${0} ${1} ${2:hello}"
is ${i} the ith argument in the command that called this particular file.
Tried searching the web for "Ruby ${0}" however the search engines don't like non-alphanumeric characters.
Consulted a Ruby book which says #{...} will substitute the results of the code in the braces, however this does not mention ${...}, is this a special syntax to substitute argvalues into a string, thanks very much,
As mentioned above ${0} will do nothing special, $0 gives the name of the script, $1 gives the first match from a regular expression.
To interpolate a command line argument you'd normally do this:
puts "first argument = #{ARGV[0]}"
However, ARGV is also aliased as $* so you could also write
puts "first argument = #{$*[0]}"
Perhaps that's where the confusion arose?

What are Ruby's numbered global variables

What do the values $1, $2, $', $` mean in Ruby?
They're captures from the most recent pattern match (just as in Perl; Ruby initially lifted a lot of syntax from Perl, although it's largely gotten over it by now :). $1, $2, etc. refer to parenthesized captures within a regex: given /a(.)b(.)c/, $1 will be the character between a and b and $2 the character between b and c. $` and $' mean the strings before and after the string that matched the entire regex (which is itself in $&), respectively.
There is actually some sense to these, if only historically; you can find it in perldoc perlvar, which generally does a good job of documenting the intended mnemonics and history of Perl variables, and mostly still applies to the globals in Ruby. The numbered captures are replacements for the capture backreference regex syntax (\1, \2, etc.); Perl switched from the former to the latter somewhere in the 3.x versions, because using the backreference syntax outside of the regex complicated parsing too much. (By the time Perl 5 rolled around, the parser had been sufficiently rewritten that the syntax was again available, and promptly reused for references/"pointers". Ruby opted for using a name-quote : instead, which is closer to the Lisp and Smalltalk style; since Ruby started out as a Perl-alike with Smalltalk-style OO, this made more sense linguistically.) The same applies to $&, which in historical regex syntax is simply & (but you can't use that outside the replacement part of a substitution, so it became a variable $& instead). $` and $' are both "cutesy": "back-quote" and "forward-quote" from the matched string.
The non-numbered ones are listed here:
$1, $2 ... $N refer to matches in a regex capturing group.
"ab:cd" =~ /([a-z]+):([a-z]+)/
Would yield
$1 = "ab"
$2 = "cd"
