scheme chicken regex logical grouping - chicken-scheme

have
(use extras format posix posix-extras regex regex-literals utils srfi-1)
have regex with logical groupings 1 and 2
/^(\\W+)\\s+(\\W+)/
but am having trouble with the syntax to actually -use- 1 and 2 .
Should I be using $1 $2 , or \1 and \2 , or something else? I'll be using
1 and 2 on the same LOC as the regex itself.
Thanks in advance,
Still-learning Steve

This question is old and it isn't particularly clear: you don't explain how you've tried to use the regex. I'll attempt to answer it anyway.
First off, there are no "special" variables $1 or $2 like in Perl or Ruby. With that out of the way, it becomes a simple matter of how to use the various procedures.
For example, with string-match, you simply receive a list of matches:
#;1> (use regex regex-literals)
#;2> (string-match #/b(a)(r)/ "bar")
("bar" "a" "r")
So, to refer to the Nth submatch you'd use (list-ref the-match N) (where 0 equals the complete matched string).
With string-substitute and when using back references within the regex, you'd use "\\1" (you have to use two backslashes to escape the backslash in string context):
#;1> (use regex regex-literal)
#;2> (string-substitute #/f(o)(\1)/ "lala\\2xyz" "foo")
"lalaoxyz"
This works in general, but there's an undocumented feature (or perhaps a bug) that if you use a backslash in front of an escape sequence in the replacement, it will be escaped. See this bugreport on how that works, and how to use irregex instead of the regex egg to aovid this.

Related

Regex not matching string in scheme but works on other platform

I am running string-match using the pattern [ \[\]a-zA-Z0-9_:.,/-]+ to match a sample text Text [a,b]. Although the pattern works on regex101, when I run it on scheme it returns #f. Here is the regex101 link.
This is the function I am running
(string-match "[ \\[\\]a-zA-Z0-9_:.,/-]+" "Text [a,b]")
Why isn't it working on scheme but works eleswhere? Am I missing something?
After discussing the issue on the guile gnu mailing list, I found out that Guile's (ice-9 regex) library uses POSIX extended regular expressions. And this flavor of regular expression doesn't support escaping in character classes [..], hence that's why it wasn't matching the strings.
However, I used the following function as a workaround and it works:
(string-match "[][a-zA-Z]+" "Text[ab]")
I don't see anything wrong with your regular expression syntax as it is quoted correctly so I assume there must be a bug in Guile, or the regexp library it uses, where \] just isn't interpreted the correct way inside brackets. I found a workaround by using the octal code point values instead:
(string-match "[A-Za-z\\[\\0135]+" "Text [a,b]")
; ==> #("Text [a,b]" (0 . 4))
Your regular expression isn't very good. It matches any combination of those chars so "]/Te,3.xt[2" also matches. If you are expecting a string like "Something [something, something]" I would rather have made /[A-Z][a-z0-9]+ [[a-z0-9]+,[a-z0-9]+]/ instead. eg.
(define pattern "[A-Z][a-z0-9]+ \\[[a-z0-9]+,[a-z0-9]+\\]")
(string-match pattern "Test [q,w]") ; ==> #("Test [q,w]" (0 . 10))
(string-match pattern "Be100 [sub,45]") ; ==> #("Be100 [sub,45]" (0 . 14))

grep wildcards issue ubuntu

I have an input file named test which looks like this
leonid sergeevich vinogradov
ilya alexandrovich svintsov
and when I use grep like this grep 'leonid*vinogradov' test it says nothing, but when I type grep 'leonid.*vinogradov' test it gives me the first string. What's the difference between * and .*? Because I see no difference between any number of any characters and any character followed by any number of any characters.
I use ubuntu 14.04.3.
* doesn't match any number of characters, like in a file glob. It is an operator, which indicates 0 or more matches of the previous character. The regular expression leonid*vinogradov would require a v to appear immediately after 0 or more ds. The . is the regular expression metacharcter representing any single character, so .* matches 0 or more arbitrary characters.
grep uses regex and .* matches 0 or more of any characters.
Where as 'leonid*vinogradov' is also evaluated as regex and it means leoni followed by 0 or more of letter d hence your match fails.
It's Regular Expression grep uses, short as regexp, not wildcards you thought. In this case, "." means any character, "" means any number of (include zero) the previous character, so "." means anything here.
Check the link, or google it, it's a powerful tool you'll find worth to knew.

Regex to match "AAAA:AAA" pattern

A string must begin with 3 or 4 letters (not numbers), and a ":" symbol should follow these letters, and after the colon there should be three more characters, like AAA. For example, AAAA:AAA or AAA:AAA.
I`m starting to build this, but regex is so much pain for me, can anyone help me with this?
Here is what I have now:
^[a-zA-Z]{3,4}(:)$
Your regex is almost there: you need to add [a-zA-Z]{3}.
I prefer the [[:alpha:]] POSIX class in Ruby to match letters though.
/[[:alpha:]]/ - Alphabetic character
POSIX bracket expressions are also similar to character classes. They provide a portable alternative to the above, with the added benefit that they encompass non-ASCII characters.
So, here is a possible regex:
\A[[:alpha:]]{3,4}:[[:alpha:]]{3}\z
See demo
The regex matches:
\A - start of string (in RoR, you have to use \A instead of ^, or you will get errors)
[[:alpha:]]{3,4} - 3 or 4 letters
: - literal :
[[:alpha:]]{3} - 3 letters
\z - end of string (in RoR, you have to use \z instead of $, or you will get errors)
To allow just AAA or AAAA, you need to introduce an optional (? quantifier) non-capturing group ((?:...) construction):
\A[[:alpha:]]{3,4}(?::[[:alpha:]]{3})?\z
^^^ ^^
See another demo
Try using this (quotes if regex in your dialect must be passed as a string)
"^[a-zA-Z]{3,4}:[a-zA-Z]{3}$"

Why does this regex run differently in sed than in Perl/Ruby?

I have a regex that gives me one result in sed but another in Perl (and Ruby).
I have the string one;two;;three and I want to highlight the substrings delimited by the ;. So I do the following in Perl:
$a = "one;two;;three";
$a =~ s/([^;]*)/[\1]/g;
print $a;
(Or, in Ruby: print "one;two;;three".gsub(/([^;]*)/, "[\\1]").)
The result is:
[one][];[two][];[];[three][]
(I know the reason for the spurious empty substrings.)
Curiously, when I run the same regexp in sed I get a different result. I run:
echo "one;two;;three" | sed -e 's/[^;]*/[\0]/g'
and I get:
[one];[two];[];[three]
What is the reason for this different result?
EDIT:
Somebody replied "because sed is not perl". I know that. The reason I'm asking my question is because I don't understand how sed copes so well with zero-length matches.
This is an interesting and surprising edge case.
Your [^;]* pattern may match the empty string, so it becomes a philosophy question, viz., how many empty strings are between two characters: zero, one, or many?
sed
The sed match clearly follows the philosophy described in the “Advancing After a Zero–Length Regex Match” section of “Zero–Length Regex Matches.”
Now the regex engine is in a tricky situation. We’re asking it to go through the entire string to find all non–overlapping regex matches. The first match ended at the start of the string, where the first match attempt began. The regex engine needs a way to avoid getting stuck in an infinite loop that forever finds the same zero-length match at the start of the string.
The simplest solution, which is used by most regex engines, is to start the next match attempt one character after the end of the previous match, if the previous match was zero–length.
That is, zero empty strings are between characters.
The above passage is not an authoritative standard, and quoting such a document instead would make this a better answer.
Inspecting the source of GNU sed, we see
/* Start after the match. last_end is the real end of the matched
substring, excluding characters that were skipped in case the RE
matched the empty string. */
start = offset + matched;
last_end = regs.end[0];
Perl and Ruby
Perl’s philosophy with s///, which Ruby seems to share—so the documentation and examples below use Perl to represent both—is there is exactly one empty string after each character.
The “Regexp Quote–Like Operators” section of the perlop documentation reads
The /g modifier specifies global pattern matching—that is, matching as many times as possible within the string.
Tracing execution of s/([^;]*)/[\1]/g gives
Start. The “match position,” denoted by ^, is at the beginning of the target string.
o n e ; t w o ; ; t h r e e
^
Attempt to match [^;]*.
o n e ; t w o ; ; t h r e e
^
Note that the result captured in $1 is one.
Attempt to match [^;]*.
o n e ; t w o ; ; t h r e e
^
Important Lesson: The * regex quantifier always succeeds because it means “zero or more.” In this case, the substring in $1 is the empty string.
The rest of the match proceeds as in the above.
Being a perceptive reader, you now ask yourself, “Self, if * always succeeds, how does the match terminate at the end of the target string, or for that matter, how does it get past even the first zero–length match?”
We find the answer to this incisive question in the “Repeated Patterns Matching a Zero–length Substring” section of the perlre documentation.
However, long experience has shown that many programming tasks may be significantly simplified by using repeated subexpressions that may match zero–length substrings. Here’s a simple example being:
#chars = split //, $string; # // is not magic in split
($whitewashed = $string) =~ s/()/ /g; # parens avoid magic s// /
Thus Perl allows such constructs, by forcefully breaking the infinite loop. The rules for this are different for lower–level loops given by the greedy quantifiers *+{}, and for higher-level ones like the /g modifier or split operator.
…
The higher–level loops preserve an additional state between iterations: whether the last match was zero–length. To break the loop, the following match after a zero–length match is prohibited to have a length of zero. This prohibition interacts with backtracking … and so the second best match is chosen if the best match is of zero length.
Other Perl approaches
With the addition of a negative lookbehind assertion, you can filter the spurious empty matches.
$ perl -le '$a = "one;two;;three";
$a =~ s/(?<![^;])([^;]*)/[\1]/g;
print $a;'
[one];[two];[];[three]
Apply what Mark Dominus dubbed Randal’s Rule, “Use capturing when you know what you want to keep. Use split when you know what you want to throw away.” You want to throw away the semicolons, so your code becomes more direct with
$ perl -le '$a = "one;two;;three";
$a = join ";", map "[$_]", split /;/, $a;
print $a;'
[one];[two];[];[three]
From the source code for sed-4.2 for the substitute function:
/sed/execute.c
/* If we're counting up to the Nth match, are we there yet?
And even if we are there, there is another case we have to
skip: are we matching an empty string immediately following
another match?
This latter case avoids that baaaac, when passed through
s,a*,x,g, gives `xbxxcx' instead of xbxcx. This behavior is
unacceptable because it is not consistently applied (for
example, `baaaa' gives `xbx', not `xbxx'). */
This indicates that the behavior we see in Ruby and Perl was consciously avoided in sed. This is not due to any fundamental difference between the languages but a result of special handling in sed
There's something else going on in the perl (and presumably ruby) scripts as that output makes no sense for simply handling the regexp as a BRE or ERE.
awk (EREs) and sed (BREs) behave as they should for just doing an RE replacement:
$ echo "one;two;;three" | sed -e 's/[^;]*/[&]/g'
[one];[two];[];[three]
$ echo "one;two;;three" | awk 'gsub(/[^;]*/,"[&]")'
[one];[two];[];[three]
You said I know the reason for the spurious empty substrings.. Care to clue us in?

What are Ruby's numbered global variables

What do the values $1, $2, $', $` mean in Ruby?
They're captures from the most recent pattern match (just as in Perl; Ruby initially lifted a lot of syntax from Perl, although it's largely gotten over it by now :). $1, $2, etc. refer to parenthesized captures within a regex: given /a(.)b(.)c/, $1 will be the character between a and b and $2 the character between b and c. $` and $' mean the strings before and after the string that matched the entire regex (which is itself in $&), respectively.
There is actually some sense to these, if only historically; you can find it in perldoc perlvar, which generally does a good job of documenting the intended mnemonics and history of Perl variables, and mostly still applies to the globals in Ruby. The numbered captures are replacements for the capture backreference regex syntax (\1, \2, etc.); Perl switched from the former to the latter somewhere in the 3.x versions, because using the backreference syntax outside of the regex complicated parsing too much. (By the time Perl 5 rolled around, the parser had been sufficiently rewritten that the syntax was again available, and promptly reused for references/"pointers". Ruby opted for using a name-quote : instead, which is closer to the Lisp and Smalltalk style; since Ruby started out as a Perl-alike with Smalltalk-style OO, this made more sense linguistically.) The same applies to $&, which in historical regex syntax is simply & (but you can't use that outside the replacement part of a substitution, so it became a variable $& instead). $` and $' are both "cutesy": "back-quote" and "forward-quote" from the matched string.
The non-numbered ones are listed here:
https://www.zenspider.com/ruby/quickref.html#pre-defined-variables
$1, $2 ... $N refer to matches in a regex capturing group.
So:
"ab:cd" =~ /([a-z]+):([a-z]+)/
Would yield
$1 = "ab"
$2 = "cd"

Resources