Ruby, weird substitution - ruby

For example:
str1 = "pppp(m)pppp"
str2 = "(m)"
str1 = str1.sub(/#{str2}/, "<>#{str2}<>")
I will got this:
"pppp(<>(m)<>)pppp"
I expected to get this:
"pppp<>(m)<>pppp"
Why it's happening and how to avoid this?

In ( and ) have a special meaning in regexen and do not actually match the characters ( and ). The regex /(m)/ will match any m whether or not it is enclosed in parentheses (and if it is, it won't match the parentheses).
To match literal parentheses use \( and \) - or in a case like this where you're interpolating a string, you can just use Regexp.escape on the string, i.e. /#{ Regexp.escape(str2) }/.

The regular expression is viewing the "(m)" as a capture group because the parenthesis are operators in regular expressions to get a literal "(m)" you need to use the escape char \ ["\(m\)"].

Related

How below REGEXP_REPLACE works?

I have query in my project and that is having REGEXP_REPLACE
i tried to find how it works by searching but i found it like
w+ Matches a word character (that is, an alphanumeric or underscore
(_) character).
but not able to find '"\w+\":' why these "" are used and what is mean by '{|}|"',''
UPDATE (SELECT data,data_value FROM TEMP) t
SET t.DATA_VALUE=REGEXP_REPLACE(REGEXP_REPLACE(t.data, '"\w+\":',''),'{|}|"','');
can you please tell me how it works?
This appear to be a regular expression for stripping keys and enclosing brackets from a JSON string - unfortunately, if this is the case then it does not work in all situations.
The regular expression
'"\w+\":'
will match:
A " double quotation mark;
\w+ one-or-more word (a-z or A-Z or 0-9 or _) characters;
\" another double quotation mark - note: the \ character is not necessary; then
A : colon.
So:
REGEXP_REPLACE(
'{"key":"value","key2":"value with \"quote"}',
'"\w+":', -- Pattern matched
'' -- Replacement string
)
Will output:
{"value","value with \"quote"}
The second pattern {|}|" will match either a {, or a } or a " character (and could have been equivalently written as [{}"]) so:
REGEXP_REPLACE(
'{"value","value with \"quote"}',
'{|}|"', -- Pattern matched
'' -- Replacement string
)
Will output:
value,value with \quote
Which is fine, until (like my example) you have an escaped double quote (or curly braces) in the value string; in which case those will also get stripped leaving the escape character.
(Note: you would not typically find this but it is possible to include escaped quotes in the key. So {"keywith\":quote":"value"} would get replaced to {quote":"value"} and then quote:value which is not the intended output.)
If parsing JSON is what you are trying to do (pre-Oracle 12) then you can use:
REGEXP_REPLACE(
'{"key":"value","key2":"value with \"quote","keywith\":quote":"value with \"{}"}',
'^{|"(\\"|[^"])+":(")?((\\"|[^"])+?)\2((,)|})',
'\3\6'
)
Which outputs:
value,value with \"quote,value with \"{}
Or in Oracle 12 you can do:
SELECT *
FROM JSON_TABLE(
'{"key":"value","key2":"value with \"quote","keywith\":quote":"value with \"{}"}',
'$.*' NULL ON ERROR
COLUMNS (
value VARCHAR2(4000) PATH '$'
)
)
Which outputs:
VALUE
-----------------
value
value with "quote
value with "{}
example:::REGEXP_REPLACE( string, pattern [, replacement_string [, start_position [, nth_appearance [, match_parameter ] ] ] ] )
| is or(CAN MEAN MORE THAN ONE ALTERNATIVE ) , is for at least as in {n,} at least n times
https://www.techonthenet.com/oracle/functions/regexp_replace.php
"where I got my info"
'"\w+\":' why these "" are used and what is mean by '{|}|"',''
Matches a word character(\w)One or more times(+) this has to be messed up it's missing the right quantity of close parentheses by putting \" w+ \"
they allow the " to be shown. This expression takes one expression changes it then uses that as the basis for the next change. Good luck figuring the rest out. Regular expressions aren't too bad, pretty intuitive once you get the basics down.

How exactly does this work string.split(/\?|\.|!/).size?

I know, or at least I think I know, what this does (string.split(/\?|\.|!/).size); splits the string at every ending punctuation into an array and then gets the size of the array.
The part I am confused with is (/\?|\.|!/).
Thank you for your explanation.
Regular expressions are surrounded by slashes / /
The backslash before the question mark and dot means use those characters literally (don't interpret them as special instructions)
The vertical pipes are "or"
So you have / then question mark \? then "or" | then period \. then "or" | then exclamation point ! then / to end the expression.
/\?|\.|!/
It's a Regular Expression. That particular one matches any '?', '.' or '!' in the target string.
You can learn more about them here: http://regexr.com/
A regular expression splitting on the char "a" would look like this: /a/. A regular expression splitting on "a" or "b" is like this: /a|b/. So splitting on "?", "!" and "." would look like /?|!|./ - but it does not. Unfortunately, "?", and "." have special meaning in regexps which we do not want in this case, so they must be escaped, using "\".
A way to avoid this is to use Regexp.union("?","!",".") which results in /\?|!|\./
(/\?|\.|!/)
Working outside in:
The parentheses () captures everything enclosed.
The // tell Ruby you're using a Regular Expression.
\? Matches any ?
\. Matches any .
! Matches any !
The preceding \ tells Ruby we want to find these specific characters in the string, rather than using them as special characters.
Special characters (that need to be escaped to be matched) are:
. | ( ) [ ] { } + \ ^ $ * ?.
There is a nice guide to Ruby RegEx at:
http://rubular.com/ & http://www.tutorialspoint.com/ruby/ruby_regular_expressions.htm
For SO answers that involve regular expressions, I often use the "extended" mode, which makes them self-documenting. This one would be:
r = /
\? # match a question mark
| # or
\. # match a period
| # or
! # match an explamation mark
/x # extended mode
str = "Out, damn'd spot! out, I say!—One; two: why, then 'tis time to " +
"do't.—Hell is murky.—Fie, my lord, fie, a soldier, and afeard?"
str.split(r)
#=> ["Out, damn'd spot",
# " out, I say",
# "—One; two: why, then 'tis time to do't",
# "—Hell is murky",
# "—Fie, my lord, fie, a soldier, and afeard"]
str.split(r).size #=> 5
#steenslag mentioned Regexp::union. You could also use Regexp::new to write (with single quotes):
r = Regexp.new('\?|\.|!')
#=> /\?|\.|!/
but it really doesn't buy you anything here. You might find it useful in other situations, however.

Why won't my simple regex pattern match and remove a file extension?

I have a string:
app_copy--28.ipa
The result I want is:
app_copy
The number after -- could be of variable length, so I want to match everything including and after --.
I've tried a few patterns, but none are matching for some reason:
gsub("--\*", "")
gsub("--*", "")
gsub("--*.ipa", "")
gsub("--\[0-9].ipa", "")
What am I missing?
Let's take a look at your test patterns:
"--\*" is actually equivalent to "--*" (since the \* is an escape sequence).
"--*" will match a single - character, followed by zero or more - characters.
"--*.ipa" will match a single - character, followed by zero or more - characters, followed by any single character, followed by a literal ipa.
"--\[0-9].ipa" is actually equivalent to "--[0-9].ipa" (since the \[ is an escape sequence), which will match a literal --, followed by a single decimal digit, followed by any single character, followed by a literal ipa.
However, none of these patterns would work as you used them because gsub will not treat it as a regular expression:
The pattern is typically a Regexp; if given as a String, any regular expression metacharacters it contains will be interpreted literally…
You'd need to wrap type convert your pattern to a Regexp (using Regexp.new), or use a regular expression literal.
Try this pattern
--.*
This pattern will find any literal --, followed by zero or more of any character.
For example:
"app_copy--28.ipa".gsub(/--.*/, "") # app_copy
Don't use gsub to try to change the string, simply use a pattern to match the part you want:
"app_copy--28.ipa"[/^(.+?)--/, 1] # => "app_copy"
String's [] takes a lot of different types of parameters. You can pass in a pattern, and the index of the capture that you want, to extract just that part. From the documentation:
str[regexp, capture] → new_str or nil
If a Regexp is supplied, the matching portion of the string is returned. If a capture follows the regular expression, which may be a capture group index or name, follows the regular expression that component of the MatchData is returned instead.
How is this ?
str = "app_copy--28.ipa"
str[0..str.index("-")-1]
# => "app_copy"
str = "app_copy--28.ipa"
str.split("--").first
# => "app_copy"

How to remove the first 4 characters from a string if it matches a pattern in Ruby

I have the following string:
"h3. My Title Goes Here"
I basically want to remove the first four characters from the string so that I just get back:
"My Title Goes Here".
The thing is I am iterating over an array of strings and not all have the h3. part in front so I can't just ditch the first four characters blindly.
I checked the docs and the closest thing I could find was chomp, but that only works for the end of a string.
Right now I am doing this:
"h3. My Title Goes Here".reverse.chomp(" .3h").reverse
This gives me my desired output, but there has to be a better way. I don't want to reverse a string twice for no reason. Is there another method that will work?
To alter the original string, use sub!, e.g.:
my_strings = [ "h3. My Title Goes Here", "No h3. at the start of this line" ]
my_strings.each { |s| s.sub!(/^h3\. /, '') }
To not alter the original and only return the result, remove the exclamation point, i.e. use sub. In the general case you may have regular expressions that you can and want to match more than one instance of, in that case use gsub! and gsub—without the g only the first match is replaced (as you want here, and in any case the ^ can only match once to the start of the string).
You can use sub with a regular expression:
s = 'h3. foo'
s.sub!(/^h[0-9]+\. /, '')
puts s
Output:
foo
The regular expression should be understood as follows:
^ Match from the start of the string.
h A literal "h".
[0-9] A digit from 0-9.
+ One or more of the previous (i.e. one or more digits)
\. A literal period.
A space (yes, spaces are significant by default in regular expressions!)
You can modify the regular expression to suit your needs. See a regular expression tutorial or syntax guide, for example here.
A standard approach would be to use regular expressions:
"h3. My Title Goes Here".gsub /^h3\. /, '' #=> "My Title Goes Here"
gsub means globally substitute and it replaces a pattern by a string, in this case an empty string.
The regular expression is enclosed in / and constitutes of:
^ means beginning of the string
h3 is matched literally, so it means h3
\. - a dot normally means any character so we escape it with a backslash
is matched literally

Ruby gsub / regex modifiers?

Where can I find the documentation on the modifiers for gsub? \a \b \c \1 \2 \3 %a %b %c $1 $2 %3 etc.?
Specifically, I'm looking at this code... something.gsub(/%u/, unit) what's the %u?
First off, %u is nothing special in ruby regex:
mixonic#pandora ~ $ irb
irb(main):001:0> '%u'.gsub(/%u/,'heyhey')
=> "heyhey"
The definitive documentation for Ruby 1.8 regex is in the Ruby Doc Bundle:
http://ruby-doc.org/docs/ruby-doc-bundle/Manual/man-1.4/syntax.html#regexp
Strings delimited by slashes are
regular expressions. The characters
right after latter slash denotes the
option to the regular expression.
Option i means that regular expression
is case insensitive. Option i means
that regular expression does
expression substitution only once at
the first time it evaluated. Option x
means extended regular expression,
which means whitespaces and commens
are allowd in the expression. Option p
denotes POSIX mode, in which newlines
are treated as normal character
(matches with dots).
The %r/STRING/ is the another form of
the regular expression.
^
beginning of a line or string
$
end of a line or string
.
any character except newline
\w
word character[0-9A-Za-z_]
\W
non-word character
\s
whitespace character[ \t\n\r\f]
\S
non-whitespace character
\d
digit, same as[0-9]
\D
non-digit
\A
beginning of a string
\Z
end of a string, or before newline at the end
\z
end of a string
\b
word boundary(outside[]only)
\B
non-word boundary
\b
backspace(0x08)(inside[]only)
[ ]
any single character of set
*
0 or more previous regular expression
*?
0 or more previous regular expression(non greedy)
+
1 or more previous regular expression
+?
1 or more previous regular expression(non greedy)
{m,n}
at least m but most n previous regular expression
{m,n}?
at least m but most n previous regular expression(non greedy)
?
0 or 1 previous regular expression
|
alternation
( )
grouping regular expressions
(?# )
comment
(?: )
grouping without backreferences
(?= )
zero-width positive look-ahead assertion
(?! )
zero-width negative look-ahead assertion
(?ix-ix)
turns on (or off) `i' and `x' options within regular expression.
These modifiers are localized inside
an enclosing group (if any).
(?ix-ix: )
turns on (or off) i' andx' options within this non-capturing
group.
Backslash notation and expression
substitution available in regular
expressions.
Good luck!
Zenspider's Quickref contains a section explaining which escape sequences can be used in regexen and one listing the pseudo variables that get set by a regexp match. In the second argument to gsub you simply write the name of the variable with a backslash instead of a $ and it will be replaced with the value of that variable after applying the regexp. If you use a double quoted string, you need to use two backslashes.
When using the block-form of gsub you can simply use the variables directly. If you return a string containing e.g. \1 from the block, that will not be replaced with $1. That only happens when using the two-argument form.
If you use block in sub/gsub you can access to the groups like that :
>> rx = /(ab(cd)ef)/
>> s = "-abcdef-abcdef"
>> s.gsub(rx) { $2 }
=> "cdgh-cdghi"
For Ruby 1.9's Oniguruma there is a good documentation of the regular expression here.
gsub is also a string substitution function within the LUA language.
Within the LUA regex language %u represents the Upper Case character class. i.e. It will match all upper case letters. Similarly %l will match lower case.
LUA Regex Class Patterns

Resources