Ruby regex question wrt the sub method on String - ruby

I'm running through the Koans tutorial (which is a great way to learn) and I've encountered this statement:
assert_equal __, "one two-three".sub(/(t\w*)/) { $1[0, 1] }
In this statement the __ is where I'm supposed to put my expected result to make the test execute correctly. I have stared at this for a while and have pulled most of it apart but I cannot figure out what the last bit means:
{ $1[0, 1] }
The expected answer is:
"one t-three"
and I was expecting:
"t-t"

{ $1[0, 1] } is a block containing the expression $1[0,1]. $1[0,1] evaluates to the first character of the string $1, which contains the contents of the first capturing group of the last matched regex.
When sub is invoked with a regex and a block, it will find the first match of the regex, invoke the block, and then replace the matched substring with the result of the block.
So "one two-three".sub(/(t\w*)/) { $1[0, 1] } searches for the pattern t\w*. This finds the substring "two". Since the whole thing is in a capturing group, this substring is stored in $1. Now the block is called and returns "two"[0,1], which is "t". So "two" is replaced by "t" and you get "one t-three".
An important thing to note is that sub, unlike gsub, only replaces the first occurrence, not ever occurrence of the pattern.

#sepp2k already gave a really good answer, I just wanted to add how you could have used IRB to maybe get there yourself:
>> "one two-three".sub(/(t\w*)/) { $1 } #=> "one two-three"
>> "one two-three".sub(/(t\w*)/) { $1[0] } #=> "one t-three"
>> "one two-three".sub(/(t\w*)/) { $1[1] } #=> "one w-three"
>> "one two-three".sub(/(t\w*)/) { $1[2] } #=> "one o-three"
>> "one two-three".sub(/(t\w*)/) { $1[3] } #=> "one -three"
>> "one two-three".sub(/(t\w*)/) { $1[0,3] } #=> "one two-three"
>> "one two-three".sub(/(t\w*)/) { $1[0,2] } #=> "one tw-three"
>> "one two-three".sub(/(t\w*)/) { $1[0,1] } #=> "one t-three"

Cribbing from the documentation (http://ruby-doc.org/core/classes/String.html#M001185), here are answers to your two questions "why is the return value 'one t-three'" and "what does { $1[0, 1] } mean?"
What does { $1[0, 1] } mean?
The method String#sub can take either two arguments, or one argument and a block. The latter is the form being used here and it's just like the method Integer.times, which takes a block:
5.times { puts "hello!" }
So that explains the enclosing curly braces.
$1 is the substring matching the first capture group of the regex, as described here. [0, 1] is the string method "[]" which returns a substring based on the array values - here, the first character.
Put together, { $1[0, 1] } is a block which returns the first character in $1, where $1 is the substring to have been matched by a capture group when a regex was last used to match a string.
Why is the return value 'one t-three'?
The method String#sub ('substitute'), unlike its brother String#gsub ('globally substitute'), replaces the first portion of the string matching the regex with its replacement. Hence the method is going to replace the first substring matching "(t\w*)" with the value of the block described above - i.e. with its first character. Since 'two' is the first substring matching (t\w*) (a 't' followed by any number of letters), it is replaced by its first character, 't'.

Related

How do I apply gsub subject to a function?

I"m using Rails 5 and Ruby 2.4. I have a function
my_function(str1, str2)
that will return true or false given two string arguments. What I would like to do is given a larger string, for instance
"a b c d"
I would like to replace two consecutive "words" (a word by my definition is a sequence of characters followed by a word boundary) with the empty string if the expression
my_function(str1, str2)
evaluates to true for those two consecutive words. So for instance, if
my_function("b", "c")
evaluates to true, I would like the above string to become
"a d"
How do I do this?
Edit: I'm including the output based on Tom Lord's answer ...
If I use
def stuff(line)
matches = line.scan(/\b((\S+?)\b.*?\b(\S+?))\b/)
matches.each do |full_match, word1, word2|
line.delete!(full_match) if word1.eql?("hello") && word2.eql?("world")
end
end
and line is
"hello world this is a test"
the resulting string line is
"tisisatst"
THis is not quite what I expected. THe result should be
" this is a test"
Edit: This is an updated answer, based on the comments below. I have left my original answer at the bottom.
Scanning a string for "two consecutive words" is a bit tricky. Your best option is probably to use the \b anchor in a regex, which signifies a "word boundary":
string_to_change = "a b c d"
matches = string_to_change.scan(/\b((\S+?)\b.*?\b(\S+?))\b/)
# => [["a b", "a", "b"], ["c d", "c", "d"]]
...Where the first string is the "full match" (including any whitespace or punctuation), the others are the two words.
To break down that regex:
\b means "word boundary". I have placed one of each side of both strings. This solution assumes that str1 and str2 are both a single word. (If they contain spaces, then I don't know what behaviour you expect?)
\S+? means "one or more non-whitespace character". (Matching non-greedily, so it will stop matching at the first word boundary).
You can then remove each "full match" from the string, if the method returns true for the two words:
matches.each do |full_match, word1, word2|
string_to_change.gsub!(full_match, '') if my_function(word1, word2)
end
One thing that's not accounted for here (you didn't specify this well in your question...) was how to handle strings containing three or more words. For example, consider the following:
"hello world this is a test"
Suppose my_function(word1, word2) returns true only for the pairs: "world", "this" and "hello", "is".
My code above will only look at the pairs: "hello", "world", "this", "is" and "a", "test". But perhaps it should actually:
Look at all pairs of words, i.e. match all words with the left- and right- hand side.
Delete pairs of words repeatedly, i.e. after the initial pair: "world this" is removed, the string should be re-scanned and then "hello is" should also be removed?
If such further enhancements are needed, then please explain them clearly in a new question (if you are struggling to solve the problem yourself).
Original answer:
str1 = "b"
str2 = "c"
string_to_change = "a b c d"
if my_function(str1, str2)
string_to_change.gsub!(/\b#{str1}\b\s+\b#{str2}\b/, "")
end
To break down that regex:
\b means "word boundary". I have placed one of each side of both strings. This solution assumes that str1 and str2 are both a single word. (If they contain spaces, then I don't know what behaviour you expect?)
\s+ means "one or more whitespace character". You may wish to tweak this to allow other punctuation too, such as a comma or full stop. A fully generic solution to this issue could in fact be:
.
string_to_change.gsub!(/\b#{str1}\b.(\B.)*#{str2}\b/, "")
# Or equivalently:
string_to_change.gsub!(/\b#{str1}\b(.\B)*.#{str2}\b/, "")
.(\B.)* is instead collecting each character, one at a time, always checking that it's not the first letter of a word (i.e. is proceeded by a non-word boundary).

Finding the first duplicate character in the string Ruby

I am trying to call the first duplicate character in my string in Ruby.
I have defined an input string using gets.
How do I call the first duplicate character in the string?
This is my code so far.
string = "#{gets}"
print string
How do I call a character from this string?
Edit 1:
This is the code I have now where my output is coming out to me No duplicates 26 times. I think my if statement is wrongly written.
string "abcade"
puts string
for i in ('a'..'z')
if string =~ /(.)\1/
puts string.chars.group_by{|c| c}.find{|el| el[1].size >1}[0]
else
puts "no duplicates"
end
end
My second puts statement works but with the for and if loops, it returns no duplicates 26 times whatever the string is.
The following returns the index of the first duplicate character:
the_string =~ /(.)\1/
Example:
'1234556' =~ /(.)\1/
=> 4
To get the duplicate character itself, use $1:
$1
=> "5"
Example usage in an if statement:
if my_string =~ /(.)\1/
# found duplicate; potentially do something with $1
else
# there is no match
end
s.chars.map { |c| [c, s.count(c)] }.drop_while{|i| i[1] <= 1}.first[0]
With the refined form from Cary Swoveland :
s.each_char.find { |c| s.count(c) > 1 }
Below method might be useful to find the first word in a string
def firstRepeatedWord(string)
h_data = Hash.new(0)
string.split(" ").each{|x| h_data[x] +=1}
h_data.key(h_data.values.max)
end
I believe the question can be interpreted in either of two ways (neither involving the first pair of adjacent characters that are the same) and offer solutions to each.
Find the first character in the string that is preceded by the same character
I don't believe we can use a regex for this (but would love to be proved wrong). I would use the method suggested in a comment by #DaveNewton:
require 'set'
def first_repeat_char(str)
str.each_char.with_object(Set.new) { |c,s| return c unless s.add?(c) }
nil
end
first_repeat_char("abcdebf") #=> b
first_repeat_char("abcdcbe") #=> c
first_repeat_char("abcdefg") #=> nil
Find the first character in the string that appears more than once
r = /
(.) # match any character in capture group #1
.* # match any character zero of more times
? # do the preceding lazily
\K # forget everything matched so far
\1 # match the contents of capture group 1
/x
"abcdebf"[r] #=> b
"abccdeb"[r] #=> b
"abcdefg"[r] #=> nil
This regex is fine, but produces the warning, "regular expression has redundant nested repeat operator '*'". You can disregard the warning or suppress it by doing something clunky, like:
r = /([^#{0.chr}]).*?\K\1/
where ([^#{0.chr}]) means "match any character other than 0.chr in capture group 1".
Note that a positive lookbehind cannot be used here, as they cannot contain variable-length matches (i.e., .*).
You could probably make your string an array and use detect. This should return the first char where the count is > 1.
string.split("").detect {|x| string.count(x) > 1}
I'll use positive lookahead with String#[] method :
"abcccddde"[/(.)(?=\1)/] #=> c
As a variant:
str = "abcdeff"
p str.chars.group_by{|c| c}.find{|el| el[1].size > 1}[0]
prints "f"

Rails: Remove substring from the string if in array

I know I can easily remove a substring from a string.
Now I need to remove every substring from a string, if the substring is in an array.
arr = ["1. foo", "2. bar"]
string = "Only delete the 1. foo and the 2. bar"
# some awesome function
string = string.replace_if_in?(arr, '')
# desired output => "Only delete the and the"
All of the functions to remove adjust a string, such as sub, gsub, tr, ... only take one word as an argument, not an array. But my array has over 20 elements, so I need a better way than using sub 20 times.
Sadly it's not only about removing words, rather about removing the whole substring as 1. foo
How would I attempt this?
You can use gsub which accepts a regex, and combine it with Regexp.union:
string.gsub(Regexp.union(arr), '')
# => "Only delete the and the "
Like follows:
arr = ["1. foo", "2. bar"]
string = "Only delete the 1. foo and the 2. bar"
arr.each {|x| string.slice!(x) }
string # => "Only delete the and the "
One extended thing, this also allows you to crop text with regexp service chars like \, or . (Uri's answer also allows):
string = "Only delete the 1. foo and the 2. bar and \\...."
arr = ["1. foo", "2. bar", "\..."]
arr.each {|x| string.slice!(x) }
string # => "Only delete the and the and ."
Use #gsub with #join on the array elements
You can use #gsub by calling #join on the elements of the array, joining them with the regex alternation operator. For example:
arr = ["foo", "bar"]
string = "Only delete the foo and the bar"
string.gsub /#{arr.join ?|}/, ''
#=> "Only delete the and the "
You can then deal with the extra spaces left behind in any way you see fit. This is a better method when you want to censor words. For example:
string.gsub /#{arr.join ?|}/, '<bleep>'
#=> "Only delete the <bleep> and the <bleep>"
On the other hand, split/reject/join might be a better method chain if you need to care about whitespace. There's always more than one way to do something, and your mileage may vary.

Using array results from ruby gsub method and a match block

I'm working through the Ruby koans and have hit one that's really confusing me.
"one two-three".gsub(/(t\w*)/) { $1[0, 1] }
=> "one t-t"
However, when I modify the return array for the $1 variable, I get a confusing result.
"one two-three".gsub(/(t\w*)/) { $1[1, 2] }
=> "one wo-hr"
Given the first result, I'd expect the second bit of code to return "one w-h". Why are two characters being returned in the second instance?
You expect "one w-h" which would be the result of this:
"one two-three".gsub(/(t\w*)/) { $1[1, 1] }
[] is a method on string where a range can be provided like so:
str[start, length]
so the 2 in your code is actually the length (i.e. number of characters)

What is between { }?

There is a piece of code:
def test_sub_is_like_find_and_replace
assert_equal "one t-three", "one two-three".sub(/(t\w*)/) { $1[0, 1] }
end
I found it really hard to understand what is between { } braces. Could anyone explain it please?
The {...} is a block. Ruby will pass the matched value to the block, and substitute the return value of the block back into the string. The String#sub documentation explains this more fully:
In the block form, the current match string is passed in as a parameter, and variables such as $1, $2, $`, $&, and $' will be set appropriately. The value returned by the block will be substituted for the match on each call.
Edit: Per Michael's comment, if you're confused about $1[0, 1], this is just taking the first capture ($1) and taking a substring of it (the first character, specifically). $1 is a global variable set to the contents of the first capture after a regex (in true Perl fashion), and since it's a string, the #[] operator is used to take a substring of it starting at index 0, with a length of 1.
The sub method either takes two arguments, first being the text to replace replace and the second being the replacement, or one argument being the text to replace and a block defining how to handle the replacement.
The block method is useful if you can't define your replacement as a simple string.
For example:
"foo".sub(/(\w)/) { $1.upcase }
# => "Foo"
"foo".sub(/(\w+)/) { $1.upcase }
# => "FOO"
The gsub method works the same way, but applies more than once:
"foo".gsub(/(\w)/) { $1.upcase }
# => "FOO"
In all cases, $1 refers to the contents captured by the brackets (\w).
Your code, illustrated
r = "one two-three".sub(/(t\w*)/) do
$1 # => "two"
$1[0, 1] # => "t"
end
r # => "one t-three"
sub is taking in a regular expression in it. The $1 is a reserved global variable that contains the match for the regular expression.
The brackets represent a block of code used that will substitute the match with the string returned by the block. In this case
puts $1
#=> "two"
puts $1[0, 1]
#=> "t"

Resources