Ruby Regex Group Replacement - ruby

I am trying to perform regular expression matching and replacement on the same line in Ruby. I have some libraries that manipulate strings in Ruby and add special formatting characters to it. The formatting can be applied in any order. However, if I would like to change the string formatting, I want to keep some of the original formatting. I'm using regex for that. I have the regular expression matching correctly what I need:
mystring.gsub(/[(\e\[([1-9]|[1,2,4,5,6,7,8]{2}m))|(\e\[[3,9][0-8]m)]*Text/, 'New Text')
However, what I really want is the matching from the first grouping found in:
(\e\[([1-9]|[1,2,4,5,6,7,8]{2}m))
to be appended to New Text and replaced as opposed to just New Text. I'm trying to reference the match in the form of
mystring.gsub(/[(\e\[([1-9]|[1,2,4,5,6,7,8]{2}m))|(\e\[[3,9][0-8]m)]*Text/, '\1' + 'New Text')
but my understanding is that \1 only works when using \d or \k. Is there any way to reference that specific capturing group in my replacement string? Additionally, since I am using an asterik for the [], I know that this grouping could occur more than once. Therefore, I would like to have the last matching occurrence yielded.
My expected input/output with a sample is:
Input: "\e[1mHello there\e[34m\e[40mText\e[0m\e[0m\e[22m"
Output: "\e[1mHello there\e[40mNew Text\e[0m\e[0m\e[22m"
Input: "\e[1mHello there\e[44m\e[34m\e[40mText\e[0m\e[0m\e[22m"
Output: "\e[1mHello there\e[40mNew Text\e[0m\e[0m\e[22m"
So the last grouping is found and appended.

You can use the following regex with back-reference \\1 in the replacement:
reg = /(\\e\[(?:[0-9]{1,2}|[3,9][0-8])m)+Text/
mystring = "\\e[1mHello there\\e[34m\\e[40mText\\e[0m\\e[0m\\e[22m"
puts mystring.gsub(reg, '\\1New Text')
mystring = "\\e[1mHello there\\e[44m\\e[34m\\e[40mText\\e[0m\\e[0m\\e[22m"
puts mystring.gsub(reg, '\\1New Text')
Output of the IDEONE demo:
\e[1mHello there\e[40mNew Text\e[0m\e[0m\e[22m
\e[1mHello there\e[40mNew Text\e[0m\e[0m\e[22m
Mind that your input has backslash \ that needs escaping in a regular string literal. To match it inside the regex, we use double slash, as we are looking for a literal backslash.

Related

Ruby regex | Match enclosing brackets

I'm trying to create a regex pattern to match particular sets of text in my string.
Let's assume this is the string ^foo{bar}#Something_Else
I would like to match ^foo{} skipping entirely the content of the brackets.
Until now i figured out how to get all everything with this regex here \^(\w)\{([^\}]+)} but i really don't know how to ignore the text inside the curly brackets.
Anyone has an idea? Thanks.
Update
This is the final solution:
puts script.gsub(/(\^\w+)\{([^}]+)(})/, '[BEFORE]\2[AFTER]')
Though I'd prefer this with fewer groups:
puts script.gsub(/\^\w+\{([^}]+)}/, '[BEFORE]\1[AFTER]')
Original answer
I need to replace the ^foo{} part with something else
Here is a way to do it with gsub:
s = "^foo{bar}#Something_Else"
puts s.gsub(/(.*)\^\w+\{([^}]+)}(.*)/, '\1SOMETHING ELSE\2\3')
See demo
The technique is the same: you capture the text you want to keep and just match text you want to delete, and use backreferences to restore the text you captured.
The regex matches:
(.*) - matches and captures into Group 2 as much text as possible from the start
\^\w+\{ - matches ^, 1 or more word characters, {
([^}]+) - matches and captures into Group 2 1 or more symbols other than }
} - matches the }
(.*) - and finally match and capture into Group 3 the rest of the string.
If you mean to match ^foo{} by a single match against a regex, it is impossible. A regex match only matches a substring of the original string. Since ^foo{} is not a substring of ^foo{bar}#Something_Else, you cannot match that with a single match.

Regex matching chars around text

I have a string with chars inside and I would like to match only the chars around a string.
"This is a [1]test[/1] string. And [2]test[/2]"
Rubular http://rubular.com/r/f2Xwe3zPzo
Currently, the code in the link matches the text inside the special chars, how can I change it?
Update
To clarify my question. It should only match if the opening and closing has the same number.
"[2]first[/2] [1]second[/2]"
In the code above, only first should match and not second. The text inside the special chars (first), should be ignored.
Try this:
(\[[0-9]\]).+?(\[\/[0-9]\])
Permalink to the example on Rubular.
Update
Since you want to remove the 'special' characters, try this instead:
foo = "This is a [1]test[/1] string. And [2]test[/2]"
foo.gsub /\[\/?\d\]/, ""
# => "This is a test string. And test"
Update, Part II
You only want to remove the 'special' characters when the surrounding tags match, so what about this:
foo = "This is a [1]test[/1] string. And [2]test[/2], but not [3]test[/2]"
foo.gsub /(?:\[(?<number>\d)\])(?<content>.+?)(?:\[\/\k<number>\])/, '\k<content>'
# => "This is a test string. And test, but not [3]test[/2]"
\[([0-9])\].+?\[\/\1\]
([0-9]) is a capture since it is surrounded with parentheses. The \1 tells it to use the result of that capture. If you had more than one capture, you could reference them as well, \2, \3, etc.
Rubular
You can also use a named capture, rather than \1 to make it a little less cryptic. As in: \[(?<number>[0-9])\].+?\[\/\k<number>\]
Here's a way to do it that uses the form of String#gsub that takes a block. The idea is to pull strings such as "[1]test[/1]" into the block, and there remove the unwanted bits.
str = "This is a [1]test[/1] string. And [2]test[/2], plus [3]test[/99]"
r = /
\[ # match a left bracket
(\d+) # capture one or more digits in capture group 1
\] # match a right bracket
.+? # match one or more characters lazily
\[\/ # match a left bracket and forward slash
\1 # match the contents of capture group 1
\] # match a right bracket
/x
str.gsub(r) { |s| s[/(?<=\]).*?(?=\[)/] }
#=> "This is a test string. And test, plus [3]test[/99]"
Aside: When I first heard of named capture groups, they seemed like a great idea, but now I wonder if they really make regexes easier to read than \1, \2....

String gsub - Replace characters between two elements, but leave surrounding elements

Suppose I have the following string:
mystring = "start/abc123/end"
How can you splice out the abc123 with something else, while leaving the "/start/" and "/end" elements intact?
I had the following to match for the pattern, but it replaces the entire string. I was hoping to just have it replace the abc123 with 123abc.
mystring.gsub(/start\/(.*)\/end/,"123abc") #=> "123abc"
Edit: The characters between the start & end elements can be any combination of alphanumeric characters, I changed my example to reflect this.
You can do it using this character class : [^\/] (all that is not a slash) and lookarounds
mystring.gsub(/(?<=start\/)[^\/]+(?=\/end)/,"7")
For your example, you could perhaps use:
mystring.gsub(/\/(.*?)\//,"/7/")
This will match the two slashes between the string you're replacing and putting them back in the substitution.
Alternatively, you could capture the pieces of the string you want to keep and interpolate them around your replacement, this turns out to be much more readable than lookaheads/lookbehinds:
irb(main):010:0> mystring.gsub(/(start)\/.*\/(end)/, "\\1/7/\\2")
=> "start/7/end"
\\1 and \\2 here refer to the numbered captures inside of your regular expression.
The problem is that you're replacing the entire matched string, "start/8/end", with "7". You need to include the matched characters you want to persist:
mystring.gsub(/start\/(.*)\/end/, "start/7/end")
Alternatively, just match the digits:
mystring.gsub(/\d+/, "7")
You can do this by grouping the start and end elements in the regular expression and then referring to these groups in in the substitution string:
mystring.gsub(/(?<start>start\/).*(?<end>\/end)/, "\\<start>7\\<end>")

Match consecutive list of exactly one character in set with regular expressions

I don't think I'll even try to explain this, I don't know the words to, but I'd like to achieve the following:
Given a string like this:
+++>><<<--
I'd like a match to give me: +++, but also match if any of the other characters were in the string consecutively like they are. So if the +++ wasn't there, I'd like to match >>.
I tried using the following regular expression:
([><\-\+]+)
However, given the string above, it would match the entire string, and not the first list of consecutive characters.
If it makes a difference, this is in Ruby (1.9.3).
Not sure about the ruby bit, but you can do this with backreferences in the pattern:
(.)\1+
What this does is to use a capturing group () to capture any character . followed by any number + of the same character \1. The \1 is a backreference to the the first captured group; in a pattern with more capturing groups \2 would be the second captured group and so on.
Java Example
Pattern p = Pattern.compile("(.)\\1+");
Matcher m = p.matcher("aaabbccaa");
m.find();
System.out.println(m.group(0)); // prints "aaa"
Ruby Example
# Return an array of matched patterns.
string = '+++>><<<--'
string.scan( /((.)\2+)/ ).collect { |match| match.first }

How to remove the first 4 characters from a string if it matches a pattern in Ruby

I have the following string:
"h3. My Title Goes Here"
I basically want to remove the first four characters from the string so that I just get back:
"My Title Goes Here".
The thing is I am iterating over an array of strings and not all have the h3. part in front so I can't just ditch the first four characters blindly.
I checked the docs and the closest thing I could find was chomp, but that only works for the end of a string.
Right now I am doing this:
"h3. My Title Goes Here".reverse.chomp(" .3h").reverse
This gives me my desired output, but there has to be a better way. I don't want to reverse a string twice for no reason. Is there another method that will work?
To alter the original string, use sub!, e.g.:
my_strings = [ "h3. My Title Goes Here", "No h3. at the start of this line" ]
my_strings.each { |s| s.sub!(/^h3\. /, '') }
To not alter the original and only return the result, remove the exclamation point, i.e. use sub. In the general case you may have regular expressions that you can and want to match more than one instance of, in that case use gsub! and gsub—without the g only the first match is replaced (as you want here, and in any case the ^ can only match once to the start of the string).
You can use sub with a regular expression:
s = 'h3. foo'
s.sub!(/^h[0-9]+\. /, '')
puts s
Output:
foo
The regular expression should be understood as follows:
^ Match from the start of the string.
h A literal "h".
[0-9] A digit from 0-9.
+ One or more of the previous (i.e. one or more digits)
\. A literal period.
A space (yes, spaces are significant by default in regular expressions!)
You can modify the regular expression to suit your needs. See a regular expression tutorial or syntax guide, for example here.
A standard approach would be to use regular expressions:
"h3. My Title Goes Here".gsub /^h3\. /, '' #=> "My Title Goes Here"
gsub means globally substitute and it replaces a pattern by a string, in this case an empty string.
The regular expression is enclosed in / and constitutes of:
^ means beginning of the string
h3 is matched literally, so it means h3
\. - a dot normally means any character so we escape it with a backslash
is matched literally

Resources