Ruby Regexp gsub, replace instances of second matching character - ruby

I would like to replace the first letter after a hyphen in a string with a capitalised letter.
"this-is-a-string" should become "thisIsAString"
"this-is-a-string".gsub( /[-]\w/, '\1'.upcase )
I was hoping that \1 would reinsert my second character match \w and that I could capitalise it.
How does one use the \0 \1 etc options?

You need to capture \w to be able to refer to the submatch.
Use
"this-is-a-string".gsub(/-(\w)/) {$~[1].upcase}
# => thisIsAString
See the Ruby demo
Note that $~[1] inside the {$~[1].upcase} block is actually the text captured with (\w), the $~ is a matchdata object instantiated with gsub and [1] is the index of the first group defined with a pair of unescaped parentheses.
See more details about capturing groups in the Use Parentheses for Grouping and Capturing section at regular-expressions.info.

Related

ruby sub when replacement string starts with \0

In ruby, sub does not allow to replace a string by another one starting with '\0'.
'a'.sub('a','\\0b')
Returns:
'ab'
The doc says that \0 is interpreted as a backreference, but as the first parameter is not a Regexp, I don't understand why it works like that.
If you want your second argument to be interpreted as a plain String you can escape it like:
'a'.sub('a', Regexp.escape('\0b'))
or
'a'.sub('a', '\\\0b')
both returns:
"\\0b"
Explanation about this behaviour can be found in documentation
sub(pattern, replacement) → new_str
The pattern is typically a Regexp; if given as a String, any regular
expression metacharacters it contains will be interpreted literally,
e.g. '\d' will match a backslash followed by 'd', instead of a digit.
If replacement is a String it will be substituted for the matched
text. It may contain back-references to the pattern's capture groups
of the form "\d", where d is a group number, or "\k<n>", where n is a
group name. If it is a double-quoted string, both back-references must
be preceded by an additional backslash. However, within replacement
the special match variables, such as $&, will not refer to the current
match. If replacement is a String that looks like a pattern's capture
group but is actually not a pattern capture group e.g. "\'", then it
will have to be preceded by two backslashes like so "\'".

How do I tune this regex to return the matches I want?

So I have a string that looks like this:
#jackie#test.com, #mike#test.com
What I want to do is before any email in this comma separated list, I want to remove the #. The issue I keep running into is that if I try to do a regular \A flag like so /[\A#]+/, it finds all the instances of # in that string...including the middle crucial #.
The same thing happens if I do /[\s#]+/. I can't figure out how to just look at the beginning of each string, where each string is a complete email address.
Edit 1
Note that all I need is the regex, I already have the rest of the stuff I need to do what I want. Specifically, I am achieving everything else like this:
str.gsub(/#/, '').split(',').map(&:strip)
Where str is my string.
All I am looking for is the regex portion for my gsub.
You may use the below negative lookbehind based regex.
str.gsub(/(?<!\S)#/, '').split(',').map(&:strip)
(?<!\S) Negative lookbehind asserts that the character or substring we are going to match would be preceeded by any but not of a non-space character. So this matches the # which exists at the start or the # which exists next to a space character.
Difference between my answer and hwnd's str.gsub(/\B#/, '') is, mine won't match the # which exists in :# but hwnd's answer does. \B matches between two word characters or two non-word characters.
Here is one solution
str = "#jackie#test.com, #mike#test.com"
p str.split(/,[ ]+/).map{ |i| i.gsub(/^#/, '')}
Output
["jackie#test.com", "mike#test.com"]

Ruby regex - gsub only captured group

I'm not quite sure I understand how non-capturing groups work. I am looking for a regex to produce this result: 5.214. I thought the regex below would work, but it is replacing everything including the non-capture groups. How can I write a regex to only replace the capture groups?
"5,214".gsub(/(?:\d)(,)(?:\d)/, '.')
# => ".14"
My desired result:
"5,214".gsub(some_regex)
#=> "5.214
non capturing groups still consumes the match
use
"5,214".gsub(/(\d+)(,)(\d+)/, '\1.\3')
or
"5,214".gsub(/(?<=\d+)(,)(?=\d+)/, '.')
You can't. gsub replaces the entire match; it does not do anything with the captured groups. It will not make any difference whether the groups are captured or not.
In order to achieve the result, you need to use lookbehind and lookahead.
"5,214".gsub(/(?<=\d),(?=\d)/, '.')
It is also possible to use Regexp.last_match (also available via $~) in the block version to get access to the full MatchData:
"5,214".gsub(/(\d),(\d)/) { |_|
match = Regexp.last_match
"#{match[1]}.#{match[2]}"
}
This scales better to more involved use-cases.
Nota bene, from the Ruby docs:
the ::last_match is local to the thread and method scope of the method that did the pattern match.
gsub replaces the entire match the regular expression engine produces. Both capturing/non-capturing group constructs are not retained. However, you could use lookaround assertions which do not "consume" any characters on the string.
"5,214".gsub(/\d\K,(?=\d)/, '.')
Explanation: The \K escape sequence resets the starting point of the reported match and any previously consumed characters are no longer included. That being said, we then look for and match the comma, and the Positive Lookahead asserts that a digit follows.
I know nothing about ruby.
But from what i see in the tutorial
gsub mean replace,
the pattern should be /(?<=\d+),(?=\d+)/ just replace the comma with dot
or, use capture /(\d+),(\d+)/ replace the string with "\1.\2"?
You can easily reference capture groups in the replacement string (second argument) like so:
"5,214".gsub(/(\d+)(,)(\d+)/, '\1.\3')
#=> "5.214"
\0 will return the whole matched string.
\1 will be replaced by the first capturing group.
\2 will be replaced by the second capturing group etc.
You could rewrite the example above using a non-capturing group for the , char.
"5,214".gsub(/(\d+)(?:,)(\d+)/, '\1.\2')
#=> "5.214"
As you can see, the part after the comma is now the second capturing group, since we defined the middle group as non-capturing.
Although it's kind of pointless in this case. You can just omit the capturing group for , altogether
"5,214".gsub(/(\d+),(\d+)/, '\1.\2')
#=> "5.214"
You don't need regexp to achieve what you need:
'1,200.00'.tr('.','!').tr(',','.').tr('!', ',')
Periods become bangs (1,200!00)
Commas become periods (1.200!00)
Bangs become commas (1.200,00)

preg_match search pattern, stop at character combination

I am trying to pull a whole Mysql statement from a database sql file
INSERT INTO `helppages`
(`HelpPageID`, `ShowHelpItem`, `HelpRank`, `HelpCategory`, `HelpTitle`, `HelpDescription`, `HelpLink`, `HelpText`, `CMSHelpBar`, `CMSHelpBarAdditional`)
VALUES (... characters (Too many to post here, but the expression below grabs all) ...
);
The current, though I have been through many variations, expression I am using is:
preg_match("#INSERT INTO `$SearchingTableName` ([!%&'-/:<=>#^`\;\s\d\w\"\#\$\(\)\*\+\,\.\?\[\]\{\}\(\)\\\|©]*?)\)\;\r\n#s", $uploadedfile, $matches);
which gets all the information but I can't get it to stop at the end ");\r\n"
also $SearchingTableName = helppages.
Edit
Sorry the current expression uses look forward
preg_match("#INSERT INTO `$SearchingTableName` ([!%&'-/:<=>#^`\;\s\d\w\"\#\$\(\)\*\+\,\.\?\[\]\{\}\(\)\\\|©]*)(?!\)\;\r\n)#s", $uploadedfile, $matches);
Also I checked with MSword using );^p and there is only one instance at the end of the Insert
To match this kind of string you can't do it only playing with character classes. You need to describe the string structure.
For this simple particular case you can use this pattern:
$pattern = <<<EOD
~
# definitions
(?(DEFINE)
(?<elt> [^"',)]+ | '(?>[^\\']+|\\.)*' | "(?>[^\\"]+|\\.)*" )
(?<list> \( \g<elt>? (?: \s* , \s* \g<elt> )* \) )
)
# main pattern
INSERT \s+ (?:INTO \s+)? `$SearchingTableName` \s* \g<list>? \s* VALUES \s*
\g<list> \s* (?: , \s* \g<list> \s* )* ;
~xs
EOD;
if (preg_match_all($pattern, $uploadedfile, $m))
print_r($m[0]);
online demo
But keep in mind that parsing a programming language is not an easy task and is full of traps (depending of the syntax) even for the capabilities of the PHP regex engine. (It's however possible.)
regex features used here:
delimiters and modifiers:
The pattern delimiter used here is ~ instead of the classical /. There is no literal ~ in the pattern thus it's ok.
The pattern uses two modifiers: s and x:
by default the . can't match the newline character \n. The s modifier (s for singleline mode) changes this behavior. When used the . can match all characters including the newline character. (Note that you can retrieve this default behavior using \N that doesn't match the newline character whatever the mode.)
x switches on the extended mode. In this mode, whitespaces inside the pattern are ignored. This mode allows too inline comments that begin with a sharp character #. This mode is very useful to make readable long patterns using spaces, indentation and comments.
using named captures
When you have a long pattern and when you need to reuse several times the same subpatterns, you have the possibility to reuse subpatterns that are written inside capture groups.
A quick example:
You want to match several items separated by commas and composed with 4 digits and 4 letters like this: 1234abcd,5678efgh,9012ijkl,3456mnop.
The pattern to do that is obviously ^\d{4}[a-z]{4}(?:,\d{4}[a-z]{4})+$
But if I don't want to write \d{4}[a-z]{4} two times, I can put it in a capture group and use an alias for the subpattern in the capture group, like this: ^(\d{4}[a-z]{4})(?:,(?1))+$.
Here the (?1) is an alias for the subpattern inside the capture group 1 (not the content matched by the subpattern as a backreference \1 does, but the subpattern itself) that is \d{4}[a-z]{4}.
PCRE, the regex engine used by PHP supports this syntax too \g<1> instead of (?1).
But if you have a lot of capture groups in the pattern, it is not always handy to remember what's the number of the capture group you need. This is the reason why you have the possibility to name capturing groups. Example: ^(?<diglet>\d{4}[a-z]{4})(?:,\g<diglet>)+$
The other advantage of named patterns, except to make the whole pattern more readable, is to add a semantical dimension to the pattern, in the same way you can do it by addying an id attribute to an html tag.
definition section
Instead of defining the named subpattern directly in the main pattern like in the previous example, you can use a definition section to put all the subpatterns that would be used in the main pattern. Note that all that is inside this section is only here for definition purpose and doesn't match nothing. It's like a zero-width assertion.
The syntax of this section is : (?(DEFINE)(?<diglet>\d{4}[a-z]{4})) (you can put several named subpatterns inside.). The precedant pattern becomes:(?(DEFINE)(?<diglet>\d{4}[a-z]{4}))^\g<diglet>(?:,\g<diglet>)+$
the pattern itself:
The first part of the pattern enclosed between (?(DEFINE) and ) consists of subpatterns definitions that will be used later in the main pattern.
The elt subpattern describes an item (a column name or a value):
[^"',)]+ # all that is not a quote a comma or a closing parenthese:
# in the present context this will match numbers and column names
| # OR
'(?>[^\\']+|\\.)*' # string between single quotes (designed to deal with escaped quotes)
|
"(?>[^\\"]+|\\.)*" # same for double quotes
The list subpattern describes the full list of elements separated by commas between parenthesis. Note that this subpattern use a reference to the elt subpattern.
The main pattern needs only to reuse the subpattern list.

What does <\1> mean in String#sub?

I was reading this and I did not understand it. I have two questions.
What is the difference ([aeiou]) and [aeiou]?
What does <\1> mean?
"hello".sub(/([aeiou])/, '<\1>') #=> "h<e>llo"
Here it documented:
If replacement is a String it will be substituted for the matched text. It may contain back-references to the pattern’s capture groups of the form "\d", where d is a group number, or "\k<n>", where n is a group name. If it is a double-quoted string, both back-references must be preceded by an additional backslash. However, within replacement the special match variables, such as &$, will not refer to the current match.
Character Classes
A character class is delimited with square brackets ([, ]) and lists characters that may appear at that point in the match. /[ab]/ means a or b, as opposed to /ab/ which means a followed by b.
Hope above definition made clear what [aeiou] is.
Capturing
Parentheses can be used for capturing. The text enclosed by the nth group of parentheses can be subsequently referred to with n. Within a pattern use the backreference \n; outside of the pattern use MatchData[n].
Hope above definition made clear what ([aeiou]) is.
([aeiou]) - any characters inside the character class [..],which will be found first from the string "hello",is the value of \1(i.e.the first capture group). In this example value of \1 is e,which will be replaced by <e> (as you defined <\1>). That's how "h<e>llo" has been generated from the string hello using String#sub method.
The doc you post says
It may contain back-references to the pattern’s capture groups of the
form "\d", where d is a group number, or "\k", where n is a group
name.
So \1 matches whatever was captured in the first () group, i.e. one of [aeiou] and then uses it in the replacement <\1>

Resources