Ruby gsub / regex modifiers? - ruby

Where can I find the documentation on the modifiers for gsub? \a \b \c \1 \2 \3 %a %b %c $1 $2 %3 etc.?
Specifically, I'm looking at this code... something.gsub(/%u/, unit) what's the %u?

First off, %u is nothing special in ruby regex:
mixonic#pandora ~ $ irb
irb(main):001:0> '%u'.gsub(/%u/,'heyhey')
=> "heyhey"
The definitive documentation for Ruby 1.8 regex is in the Ruby Doc Bundle:
http://ruby-doc.org/docs/ruby-doc-bundle/Manual/man-1.4/syntax.html#regexp
Strings delimited by slashes are
regular expressions. The characters
right after latter slash denotes the
option to the regular expression.
Option i means that regular expression
is case insensitive. Option i means
that regular expression does
expression substitution only once at
the first time it evaluated. Option x
means extended regular expression,
which means whitespaces and commens
are allowd in the expression. Option p
denotes POSIX mode, in which newlines
are treated as normal character
(matches with dots).
The %r/STRING/ is the another form of
the regular expression.
^
beginning of a line or string
$
end of a line or string
.
any character except newline
\w
word character[0-9A-Za-z_]
\W
non-word character
\s
whitespace character[ \t\n\r\f]
\S
non-whitespace character
\d
digit, same as[0-9]
\D
non-digit
\A
beginning of a string
\Z
end of a string, or before newline at the end
\z
end of a string
\b
word boundary(outside[]only)
\B
non-word boundary
\b
backspace(0x08)(inside[]only)
[ ]
any single character of set
*
0 or more previous regular expression
*?
0 or more previous regular expression(non greedy)
+
1 or more previous regular expression
+?
1 or more previous regular expression(non greedy)
{m,n}
at least m but most n previous regular expression
{m,n}?
at least m but most n previous regular expression(non greedy)
?
0 or 1 previous regular expression
|
alternation
( )
grouping regular expressions
(?# )
comment
(?: )
grouping without backreferences
(?= )
zero-width positive look-ahead assertion
(?! )
zero-width negative look-ahead assertion
(?ix-ix)
turns on (or off) `i' and `x' options within regular expression.
These modifiers are localized inside
an enclosing group (if any).
(?ix-ix: )
turns on (or off) i' andx' options within this non-capturing
group.
Backslash notation and expression
substitution available in regular
expressions.
Good luck!

Zenspider's Quickref contains a section explaining which escape sequences can be used in regexen and one listing the pseudo variables that get set by a regexp match. In the second argument to gsub you simply write the name of the variable with a backslash instead of a $ and it will be replaced with the value of that variable after applying the regexp. If you use a double quoted string, you need to use two backslashes.
When using the block-form of gsub you can simply use the variables directly. If you return a string containing e.g. \1 from the block, that will not be replaced with $1. That only happens when using the two-argument form.

If you use block in sub/gsub you can access to the groups like that :
>> rx = /(ab(cd)ef)/
>> s = "-abcdef-abcdef"
>> s.gsub(rx) { $2 }
=> "cdgh-cdghi"

For Ruby 1.9's Oniguruma there is a good documentation of the regular expression here.

gsub is also a string substitution function within the LUA language.
Within the LUA regex language %u represents the Upper Case character class. i.e. It will match all upper case letters. Similarly %l will match lower case.
LUA Regex Class Patterns

Related

Regex to select all the commas from string that do not have any white space around them

I want to select all the commas in a string that do not have any white space around. Suppose I have this string:
"He,she, They"
I want to select only the comma between he and she. I tried this in rubular and came up with this regex:
(,[^(,\s)(\s,)])
This selects the comma that I want, but also selects an s which is a character after it.
In your regex (,[^(,\s)(\s,)]) you capture a comma followed by a negated character class that matches not any of the specified characters, which could also be written as (,[^)(,\s]) which will capture for example ,s in a group,
What you could do is use a positive lookahead and a positve lookbehind to check what is on the left and what is on the right is not a \S whitespace character:
(?<=\S),(?=\S)
Regex demo
In Ruby, you may use [[:space:]] to match any (Unicode) whitespace and [^[:space:]] to match any char other than whitespace. Using these character classes inside lookarounds solves the problem:
/(?<=[^[:space:]]),(?=[^[:space:]])/
See the Rubular demo
Here,
(?<=[^[:space:]]) - a positive lookbehind that matches a location that is immediately preceded with a non-whitespace char (if the string start position should also be matched, replace with (?<![[:space:]]))
, - a comma
(?=[^[:space:]]) - a positive lookahead that matches a location that is immediately followed with a non-whitespace char (if the string end position should also be matched, replace with (?![[:space:]])).
Check the regex below and use the code hope it will help you!
re = /[^\s](,)[^\s]/m
str = 'check ,my,domain, qwe,sd'
# Print the match result
str.scan(re) do |match|
puts match.to_s
end
Check LIVE DEMO HERE

How do I match a regex in which the next non-space character is not a "/"?

How do I express in regex the letter "s" whose next non-space character is not a "/"?
These should match: "s", "str"
These should not: "s/m", "s /n"
I tried this
"str" =~ /s[^[[:space:]]]^\// #=> nil
but it does not even match the simple use case.
It seems you need to match any s that is not followed with any 0+ whitespace chars and a / after them.
Use
/s(?![[:space:]]*\/)/
See the Rubular demo.
Details
s - the letter s
(?![[:space:]]*\/) - a negative lookahead that fails the match if, immediately to the right of the current location, there are
[[:space:]]* - 0+ whitespaces
\/ - a /.
If you merely want to know the number of 's' characters that are not followed by zero or more spaces and then a forward slash (as opposed to their indices in the string), you don't have to use a regular expression.
"sea shells /by the sea s/hore".delete(" ").gsub("s/", "").count("s")
#=> 3
If you only want to know if there is at least one such 's' you could replace count("s") with include?("s").
I'm not arguing that this is preferable to the use of a regular expression.

Username Regular Expression

I need the username to be two or more characters of a-z, 0-9, all downcase. This is the current regex I am using
USER_REGEX = /\A[a-z0-9][-a-z0-9]{1,19}\z/i
With this regex, users are able to use uppercase charters in their username. How do I modify the current regex to avoid that?
The regular expression to filter for two to twenty lower-case characters or digits is
/^[a-z0-9]{2,20}$/
which means:
^ at the front of input
a-z accept lower-case 'a' through 'z'
0-9 accept '0' through '9'
{2,20} accept 2 to 20 elements from preceding [] block
$ until the end of input
You can make a regular expression case-insensitive with trailing i, as in your example; that appears to be the root of problem. That said, I don't know Ruby's peculiarities with respect to regular expressions.
If you must keep the RegEx - remove the "i" from the end
USER_REGEX = /\A[a-z0-9][-a-z0-9]{1,19}\z/i
USER_REGEX = /\A[a-z0-9][-a-z0-9]{1,19}\z/
the "i" tells the RegEx to be a case-insensitive RegEx.
but you want it to be case-sensitive and only match on lowercase letters.

Why won't my simple regex pattern match and remove a file extension?

I have a string:
app_copy--28.ipa
The result I want is:
app_copy
The number after -- could be of variable length, so I want to match everything including and after --.
I've tried a few patterns, but none are matching for some reason:
gsub("--\*", "")
gsub("--*", "")
gsub("--*.ipa", "")
gsub("--\[0-9].ipa", "")
What am I missing?
Let's take a look at your test patterns:
"--\*" is actually equivalent to "--*" (since the \* is an escape sequence).
"--*" will match a single - character, followed by zero or more - characters.
"--*.ipa" will match a single - character, followed by zero or more - characters, followed by any single character, followed by a literal ipa.
"--\[0-9].ipa" is actually equivalent to "--[0-9].ipa" (since the \[ is an escape sequence), which will match a literal --, followed by a single decimal digit, followed by any single character, followed by a literal ipa.
However, none of these patterns would work as you used them because gsub will not treat it as a regular expression:
The pattern is typically a Regexp; if given as a String, any regular expression metacharacters it contains will be interpreted literally…
You'd need to wrap type convert your pattern to a Regexp (using Regexp.new), or use a regular expression literal.
Try this pattern
--.*
This pattern will find any literal --, followed by zero or more of any character.
For example:
"app_copy--28.ipa".gsub(/--.*/, "") # app_copy
Don't use gsub to try to change the string, simply use a pattern to match the part you want:
"app_copy--28.ipa"[/^(.+?)--/, 1] # => "app_copy"
String's [] takes a lot of different types of parameters. You can pass in a pattern, and the index of the capture that you want, to extract just that part. From the documentation:
str[regexp, capture] → new_str or nil
If a Regexp is supplied, the matching portion of the string is returned. If a capture follows the regular expression, which may be a capture group index or name, follows the regular expression that component of the MatchData is returned instead.
How is this ?
str = "app_copy--28.ipa"
str[0..str.index("-")-1]
# => "app_copy"
str = "app_copy--28.ipa"
str.split("--").first
# => "app_copy"

how to remove leading and trailing non-alphabetic characters in ruby

I want to remove any leading and trailing non-alphabetic character in my string.
for eg. ":----- pt-br:-" , i want "pt-br"
Thanks
result = subject.gsub(/\A[\d_\W]+|[\d_\W]+\Z/, '')
will remove non-letters from the start and end of the string.
\A and \Z anchor the regex at the start/end of the string (^/$ would also match after/before a newline which is probably not what you want - but that might not matter in this case);
[\d_\W]+ matches one or more digits, the underscore or anything else that is not an alphanumeric character, leaving only letters.
| is the alternation operator.
In ruby 1.9.1 :
":----- pt-br:-".partition( /[a-zA-Z](...)[a-zA-Z]/ )[1]
partition searches the pattern in the string and returns the part before it, the match, and the part after it.
result = subject.gsub(/^[^a-zA-Z]+/, '').gsub(/[^a-zA-Z]+$/, '')

Resources