Find index of regex match in string - ruby

Is it possible to find the index of a match in regex while still getting the match? For example:
str = "foo [bar] hello [world]"
str.match(/\[(.*?)\]/) { |match,idx|
puts match
puts idx
}
Unfortunately, idx is nil in this example.
My real world problem is a string, where I want to replace certain sub strings, that are wrapped in brackets with parentheses, based on some conditions (e.g. if the string is inside a blacklist), e.g. "foo [bar] hello [world]" should become "foo [bar] hello (world)" when the word world is in a blacklist.

You can use String#gsub:
blacklist = ["world"]
str = "foo [bar] hello [world]"
str.gsub(/\[(\w*?)\]/) { |m|
blacklist.include?($1) ? "(#{$1})" : m
}
#=> "foo [bar] hello (world)"

If you want an Enumerator with every match object, you can use :
def matches(string, regex)
position = 0
Enumerator.new do |yielder|
while match = regex.match(string, position)
yielder << match
position = match.end(0)
end
end
end
As an example :
p matches("foo [bar] hello [world]", /\[(.*?)\]/).to_a
# [#<MatchData "[bar]" 1:"bar">, #<MatchData "[world]" 1:"world">]
p matches("foo [bar] hello [world]", /\[(.*?)\]/).map{|m| [m[1], m.begin(0)]}
# [["bar", 4], ["world", 16]]
You can get the matched string and its index from the match object.
But actually, it looks like you need gsub with a block:
"foo [bar] hello [world]".gsub(/\[(.*?)\]/){ |m| # define logic here }

Related

Replace matched lines in a file but ignore commented-out lines using Ruby

How to replace a file in Ruby, but do not touch commented-out lines? To be more specific I want to change variable in configuration file. An example would be:
irb(main):014:0> string = "#replaceme\n\t\s\t\s# replaceme\nreplaceme\n"
=> "#replaceme\n\t \t # replaceme\nreplaceme\n"
irb(main):015:0> puts string.gsub(%r{replaceme}, 'replaced')
#replaced
# replaced
replaced
=> nil
irb(main):016:0>
Desired output:
#replaceme
# replaceme
replaced
I don't fully understand the question. To do a find and replace in each line, disregarding text following a pound sign, one could do the following.
def replace_em(str, source, replacement)
str.split(/(\#.*?$)/).
map { |s| s[0] == '#' ? s : s.gsub(source, replacement) }.
join
end
str = "It was known that # that dog has fleas, \nbut who'd know that that dog # wouldn't?"
replace_em(str, "that", "the")
#=> "It was known the # that dog has fleas, \nbut who'd know the the dog # wouldn't?"
str = "#replaceme\n\t\s\t\s# replaceme\nreplaceme\n"
replace_em(str, "replaceme", "replaced")
#=> "#replaceme\n\t \t # replaceme\nreplaced\n"
For the string
str = "It was known that # that dog has fleas, \nbut who'd know that that dog # wouldn't?"
source = "that"
replacement = "the"
the steps are as follows.
a = str.split(/(\#.*?$)/)
#=> ["It was known that ", "# that dog has fleas, ",
# "\nbut who'd know that that dog ", "# wouldn't?"]
Note that the body of the regular expression must be put in a capture group in order that the text used to split the string be included as elements in the resulting array. See String#split.
b = a.map { |s| s[0] == '#' ? s : s.gsub(source, replacement) }
#=> ["It was known the ", "# that dog has fleas, ",
# "\nbut who'd know the the dog ", "# wouldn't?"]
b.join
#=> "It was known the # that dog has fleas, \nbut who'd know the the dog # wouldn't?"
How about this?
puts string.gsub(%r{^replaceme}, 'replaced')

Replace characters from string without changing its object_id in Ruby

How can I replace characters from string without changing its object_id?
For example:
string = "this is a test"
The first 7 characters need to be replaced with capitalized characters like: "THIS IS a Test" and the object_id needs to be the same. In which way can I sub or replace the characters to make it happen?
You can do it like this:
string = "this is a test"
string[0, 7] = string[0, 7].upcase
With procedural languages, one might write the equivalent of:
string = "this is in jest"
string.object_id
#=> 70309969974760
(1..7).each { |i| string[i] = string[i].upcase }
#=> 1..7
string
#=> "tHIS IS in jest"
string.object_id
#=> 70309969974760
This is not very Ruby-like, but it does offer the advantage over #sawa's solution that it does not create a temporary 7-character string. (Well, it does create a one-character string.) This is unimportant for strings of reasonable length (and for those I'd certainly concur with sawa), but it could be significant for really, really, really long strings.
Another way to do this is as follows:
string.each_char.with_index { |c,i|
string[i] = string[i].upcase if (1..7).cover?(i) }
#=> "tHIS IS in jest"
string.object_id
#=> 70309969974760
This second way might be more efficient if string is not much larger than string[start_index..end_index].
Edit:
In a comment the OP indicates that the string is to be stripped, squeeze and reversed as well as certain characters converted to upper case. That could be done on the string in place, without creating a copy, as follows:
def strip_upcase_squeeze_reverse_whew(string, upcase_range, squeeze_str=nil)
string.strip!
upcase_range.each { |i| string[i] = string[i].upcase }
squeeze_str.nil? ? string.squeeze! : string.squeeze!(squeeze_str)
string.reverse!
end
I have assumed the four operations would be performed in a particular order, but if the order should be different, that's an easy fix.
string = " this may bee inn jest, butt it's alsoo a test "
string.object_id
#=> 70309970103280
strip_upcase_squeeze_reverse_whew(string, (1..7))
#=> "tset a osla s'ti tub ,tsej ni eb YAM SIHt"
string.object_id
#=> 70309970103280
The steps:
string = "this may bee inn jest, butt it's alsoo a test"
#=> "this may bee inn jest, butt it's alsoo a test"
upcase_range = (1..7)
#=> 1..7
string.strip!
#=> nil
string
#=> "this may bee inn jest, butt it's alsoo a test"
upcase_range.each { |i| string[i] = string[i].upcase }
#=> 1..7
string
#=> "tHIS MAY bee inn jest, butt it's alsoo a test"
squeeze_str.nil? ? string.squeeze! : string.squeeze!(squeeze_str)
#=> "tHIS MAY be in jest, but it's also a test"
string
#=> "tHIS MAY be in jest, but it's also a test"
string.reverse!
#=> "tset a osla s'ti tub ,tsej ni eb YAM SIHt"
Notice that in this example, strip! does not remove any characters, and therefore returns nil. Similarly, squeeze! would return nil if there is nothing to squeeze. It is for that reason that strip! and squeeze cannot be chained.
A second example:
string = " thiiiis may beeee in jeeest"
strip_upcase_squeeze_reverse_whew(string, (12..14), "aeiouAEIOU")
Adding onto a string without changing its object id:
foo = "foo"
# => "foo"
foo.object_id
# => 70196045363960
foo << "bar"
# => "foobar"
foo.object_id
# => 70196045363960
Replace an entire string without changing its object id
foo
# => "foo"
foo.object_id
# => 70196045363960
foo.gsub!(/./, '') << 'bar'
# => 'bar'
foo.object_id
# => 70196045363960
Replace part of a string without changing its object id
foo
# => "foo"
foo.object_id
# => 70196045363960
foo.gsub!(/o/, 'z')
# => 'fzz'
foo.object_id
# => 70196045363960

Difference between << and =

Can anyone explain why foo is mutated in version 1? What is the difference between << and = assignment?
VERSION 1
foo = "apple"
bar = foo
"foo: #{foo}" # => foo: apple
bar << "hello"
"bar: #{bar}" # => bar: applehello
"foo: #{foo}" # => foo: applehello
VERSION2
foo = "apple"
bar = foo
"foo: #{foo}" # => foo: apple
bar = bar + "hello"
"bar: #{bar}" # => bar: applehello
"foo: #{foo}" # => foo: apple
Because = is an assignment as you said.
But << is not an assignment - it's concatenation operator when the left operand is a string.
So:
bar = bar + "hello"
creates a new string by joining contents of bar with "hello" and then this new string is assigned to variable bar, while:
bar << "hello"
does the in-place concatenation of string - bar won't be set to new string but the string it holds will be modified.
So with << bar and foo still keep reference to the same object while with = only bar gets a new value.
You're setting bar as a reference to foo. The << operator works in place, as in the first version, and in the second version you're using + which produces a new value, while not changing the original.
bar << "hello" appends to bar (which is foo), while bar = bar + "hello" creates a copy of the string, foo remains untouched.
String concatenation with + returns a new object:
http://www.ruby-doc.org/core-2.1.0/String.html#method-i-2B
The append operator acts on the object that the reference points to:
http://www.ruby-doc.org/core-2.1.0/String.html#method-i-3C-3C
In the first example you are appending to the object that both foo and bar point to.
In the second example you are adding "hello" to the object that bar points to which returns a new object which, in turn, bar now points to all the while foo still points to the object whose value is still just "apple"
First observe the following:--
str = "test"
#=> "test"
str[1]
#=> "e"
str1 = str
#=> "test"
str.object_id
#=> 8509820
str1.object_id
#=> 8509820
So string is stored as an array of each character in Ruby. Other language like Java also returns complete string if you just use char type reference. Similarly here also we get each char of second string added to array of characters for first string.
str << "string"
#=> "teststring"
str1
#=> "teststring"
str.object_id
#=> 8509820
str1.object_id
#=> 8509820
Here no new object gets created. Same array holds each characters of second string.
Now observe the following:--
str = "test"
#=> "test"
str1 = str
str.object_id
#=> 9812345
str1.object_id
#=> 9812345
str = str + "string"
#=> "teststring"
str.object_id
#=> 9901234
str1
#=> "test"
str1.object_id
#=> 9812345
Here we see + operator causes creation of a new object.

How do you strip substrings in ruby?

I'd like to replace/duplicate a substring, between two delimeters -- e.g.,:
"This is (the string) I want to replace"
I'd like to strip out everything between the characters ( and ), and set that substr to a variable -- is there a built in function to do this?
I would just do:
my_string = "This is (the string) I want to replace"
p my_string.split(/[()]/) #=> ["This is ", "the string", " I want to replace"]
p my_string.split(/[()]/)[1] #=> "the string"
Here are two more ways to do it:
/\((?<inside_parenthesis>.*?)\)/ =~ my_string
p inside_parenthesis #=> "the string"
my_new_var = my_string[/\((.*?)\)/,1]
p my_new_var #=> "the string"
Edit - Examples to explain the last method:
my_string = 'hello there'
capture = /h(e)(ll)o/
p my_string[capture] #=> "hello"
p my_string[capture, 1] #=> "e"
p my_string[capture, 2] #=> "ll"
var = "This is (the string) I want to replace"[/(?<=\()[^)]*(?=\))/]
var # => "the string"
str = "This is (the string) I want to replace"
str.match(/\((.*)\)/)
some_var = $1 # => "the string"
As I understand, you want to remove or replace a substring as well as set a variable equal to that substring (sans the parentheses). There are many ways to do this, some of which are slight variants of the other answers. Here's another way that also allows for the possibility of multiple substrings within parentheses, picking up from #sawa's comments:
def doit(str, repl)
vars = []
str.gsub(/\(.*?\)/) {|m| vars << m[1..-2]; repl}, vars
end
new_str, vars = doit("This is (the string) I want to replace", '')
new_str # => => "This is I want to replace"
vars # => ["the string"]
new_str, vars = doit("This is (the string) I (really) want (to replace)", '')
new_str # => "This is I want"
vars # => ["the string", "really, "to replace"]
new_str, vars = doit("This (short) string is a () keeper", "hot dang")
new_str # => "This hot dang string is a hot dang keeper"
vars # => ["short", ""]
In the regex, the ? in .*? makes .* "lazy". gsub passes each match m to the block; the block strips the parens and adds it to vars, then returns the replacement string. This regex also works:
/\([^\(]*\)/

How do I get the match data for all occurrences of a Ruby regular expression in a string?

I need the MatchData for each occurrence of a regular expression in a string. This is different than the scan method suggested in Match All Occurrences of a Regex, since that only gives me an array of strings (I need the full MatchData, to get begin and end information, etc).
input = "abc12def34ghijklmno567pqrs"
numbers = /\d+/
numbers.match input # #<MatchData "12"> (only the first match)
input.scan numbers # ["12", "34", "567"] (all matches, but only the strings)
I suspect there is some method that I've overlooked. Suggestions?
You want
"abc12def34ghijklmno567pqrs".to_enum(:scan, /\d+/).map { Regexp.last_match }
which gives you
[#<MatchData "12">, #<MatchData "34">, #<MatchData "567">]
The "trick" is, as you see, to build an enumerator in order to get each last_match.
My current solution is to add an each_match method to Regexp:
class Regexp
def each_match(str)
start = 0
while matchdata = self.match(str, start)
yield matchdata
start = matchdata.end(0)
end
end
end
Now I can do:
numbers.each_match input do |match|
puts "Found #{match[0]} at #{match.begin(0)} until #{match.end(0)}"
end
Tell me there is a better way.
I’ll put it here to make the code available via a search:
input = "abc12def34ghijklmno567pqrs"
numbers = /\d+/
input.gsub(numbers) { |m| p $~ }
The result is as requested:
⇒ #<MatchData "12">
⇒ #<MatchData "34">
⇒ #<MatchData "567">
See "input.gsub(numbers) { |m| p $~ } Matching data in Ruby for all occurrences in a string" for more information.
I'm surprised nobody mentioned the amazing StringScanner class included in Ruby's standard library:
require 'strscan'
s = StringScanner.new('abc12def34ghijklmno567pqrs')
while s.skip_until(/\d+/)
num, offset = s.matched.to_i, [s.pos - s.matched_size, s.pos - 1]
# ..
end
No, it doesn't give you the MatchData objects, but it does give you an index-based interface into the string.
input = "abc12def34ghijklmno567pqrs"
n = Regexp.new("\\d+")
[n.match(input)].tap { |a| a << n.match(input,a.last().end(0)+1) until a.last().nil? }[0..-2]
=> [#<MatchData "12">, #<MatchData "34">, #<MatchData "567">]

Resources