Using array results from ruby gsub method and a match block - ruby

I'm working through the Ruby koans and have hit one that's really confusing me.
"one two-three".gsub(/(t\w*)/) { $1[0, 1] }
=> "one t-t"
However, when I modify the return array for the $1 variable, I get a confusing result.
"one two-three".gsub(/(t\w*)/) { $1[1, 2] }
=> "one wo-hr"
Given the first result, I'd expect the second bit of code to return "one w-h". Why are two characters being returned in the second instance?

You expect "one w-h" which would be the result of this:
"one two-three".gsub(/(t\w*)/) { $1[1, 1] }
[] is a method on string where a range can be provided like so:
str[start, length]
so the 2 in your code is actually the length (i.e. number of characters)

Related

Ruby one liner lazy string evaluation

I'd like to create ruby one liner that prints some information to stdout and gets data from stdin. I've got some code:
["This should be shown first", "This second: #{gets.chomp}"].each{|i| puts "#{i}"}
...but apparently, get.chomp is evaluated in the same time when whole array is evaluated, before iteration of each element.
In result, I'm first prompted for input, and then each element is printed.
Can I somehow evaluate it lazily, print array in order and still have whole thing in one line?
One way to achieve lazy evaluation is to use procs. Something like this (multiple lines for readability):
[
-> { puts "This should be shown first" },
-> { print "This second: "; puts gets.chomp },
].each(&:call)
I don't really see the advantage of making this a one-liner since it becomes pretty unreadable, but nevertheless:
[ ->{ "This should be shown first" },
->{ "This second: #{gets.chomp}" }
].each {|line| puts line.call }
P.S. Never do "#{foo}". Use string interpolation (#{...}) when you want to, well, interpolate strings, as on the second line above. If you want to turn a non-string into a string, do foo.to_s. If you know it's already a string (or don't care if it is) just use it directly: foo. But puts automatically calls to_s on its arguments, so just do puts foo.
If you dont mind the repetiton of puts:
['puts "This should be shown first"', 'puts "This second: #{gets.chomp}"'].each{|i| eval i}
This is just to show you could use a method rather than a proc.
def line2
"#{["cat","dog"].sample}"
end
["Line 1", :line2, "line 3"].each { |l| puts (l.is_a? Symbol) ? method(l).call : l }
#=> dog

What is between { }?

There is a piece of code:
def test_sub_is_like_find_and_replace
assert_equal "one t-three", "one two-three".sub(/(t\w*)/) { $1[0, 1] }
end
I found it really hard to understand what is between { } braces. Could anyone explain it please?
The {...} is a block. Ruby will pass the matched value to the block, and substitute the return value of the block back into the string. The String#sub documentation explains this more fully:
In the block form, the current match string is passed in as a parameter, and variables such as $1, $2, $`, $&, and $' will be set appropriately. The value returned by the block will be substituted for the match on each call.
Edit: Per Michael's comment, if you're confused about $1[0, 1], this is just taking the first capture ($1) and taking a substring of it (the first character, specifically). $1 is a global variable set to the contents of the first capture after a regex (in true Perl fashion), and since it's a string, the #[] operator is used to take a substring of it starting at index 0, with a length of 1.
The sub method either takes two arguments, first being the text to replace replace and the second being the replacement, or one argument being the text to replace and a block defining how to handle the replacement.
The block method is useful if you can't define your replacement as a simple string.
For example:
"foo".sub(/(\w)/) { $1.upcase }
# => "Foo"
"foo".sub(/(\w+)/) { $1.upcase }
# => "FOO"
The gsub method works the same way, but applies more than once:
"foo".gsub(/(\w)/) { $1.upcase }
# => "FOO"
In all cases, $1 refers to the contents captured by the brackets (\w).
Your code, illustrated
r = "one two-three".sub(/(t\w*)/) do
$1 # => "two"
$1[0, 1] # => "t"
end
r # => "one t-three"
sub is taking in a regular expression in it. The $1 is a reserved global variable that contains the match for the regular expression.
The brackets represent a block of code used that will substitute the match with the string returned by the block. In this case
puts $1
#=> "two"
puts $1[0, 1]
#=> "t"

Split string into a list, but keeping the split pattern

Currently i am splitting a string by pattern, like this:
outcome_array=the_text.split(pattern_to_split_by)
The problem is that the pattern itself that i split by, always gets omitted.
How do i get it to include the split pattern itself?
Thanks to Mark Wilkins for inpsiration, but here's a shorter bit of code for doing it:
irb(main):015:0> s = "split on the word on okay?"
=> "split on the word on okay?"
irb(main):016:0> b=[]; s.split(/(on)/).each_slice(2) { |s| b << s.join }; b
=> ["split on", " the word on", " okay?"]
or:
s.split(/(on)/).each_slice(2).map(&:join)
See below the fold for an explanation.
Here's how this works. First, we split on "on", but wrap it in parentheses to make it into a match group. When there's a match group in the regular expression passed to split, Ruby will include that group in the output:
s.split(/(on)/)
# => ["split", "on", "the word", "on", "okay?"
Now we want to join each instance of "on" with the preceding string. each_slice(2) helps by passing two elements at a time to its block. Let's just invoke each_slice(2) to see what results. Since each_slice, when invoked without a block, will return an enumerator, we'll apply to_a to the Enumerator so we can see what the Enumerator will enumerator over:
s.split(/(on)/).each_slice(2).to_a
# => [["split", "on"], ["the word", "on"], ["okay?"]]
We're getting close. Now all we have to do is join the words together. And that gets us to the full solution above. I'll unwrap it into individual lines to make it easier to follow:
b = []
s.split(/(on)/).each_slice(2) do |s|
b << s.join
end
b
# => ["split on", "the word on" "okay?"]
But there's a nifty way to eliminate the temporary b and shorten the code considerably:
s.split(/(on)/).each_slice(2).map do |a|
a.join
end
map passes each element of its input array to the block; the result of the block becomes the new element at that position in the output array. In MRI >= 1.8.7, you can shorten it even more, to the equivalent:
s.split(/(on)/).each_slice(2).map(&:join)
You could use a regular expression assertion to locate the split point without consuming any of the input. Below uses a positive look-behind assertion to split just after 'on':
s = "split on the word on okay?"
s.split(/(?<=on)/)
=> ["split on", " the word on", " okay?"]
Or a positive look-ahead to split just before 'on':
s = "split on the word on okay?"
s.split(/(?=on)/)
=> ["split ", "on the word ", "on okay?"]
With something like this, you might want to make sure 'on' was not part of a larger word (like 'assertion'), and also remove whitespace at the split:
"don't split on assertion".split(/(?<=\bon\b)\s*/)
=> ["don't split on", "assertion"]
If you use a pattern with groups, it will return the pattern in the results as well:
irb(main):007:0> "split it here and here okay".split(/ (here) /)
=> ["split it", "here", "and", "here", "okay"]
Edit The additional information indicated that the goal is to include the item on which it was split with one of the halves of the split items. I would think there is a simple way to do that, but I don't know it and haven't had time today to play with it. So in the absence of the clever solution, the following is one way to brute force it. Use the split method as described above to include the split items in the array. Then iterate through the array and combine every second entry (which by definition is the split value) with the previous entry.
s = "split on the word on and include on with previous"
a = s.split(/(on)/)
# iterate through and combine adjacent items together and store
# results in a second array
b = []
a.each_index{ |i|
b << a[i] if i.even?
b[b.length - 1] += a[i] if i.odd?
}
print b
Results in this:
["split on", " the word on", " and include on", " with previous"]

What Does This Ruby/RegEx Code Do?

I'm going through Beginning Ruby From Novice To Professional 2nd Edition and am currently on page 49 where we are learning about RegEx basics. Each RegEx snippet in the book has a code trailing it that hasn't been explained.
{ |x| puts x }
In context:
"This is a test".scan(/[a-m]/) { |x| puts x }
Could someone please clue me in?
A method such as scan is an iterator; in this case, each time the passed regex is matched, scan does something programmer-specified. In Ruby, the "something" is expressed as a block, represented by { code } or do code end (with different precedences), which is passed as a special parameter to the method. A block may start with a list of parameters (and local variables), which is the |x| part; scan invokes the block with the string it matched, which is bound to x inside the block. (This syntax comes from Smalltalk.)
So, in this case, scan will invoke its block parameter every time /[a-m]/ matches, which means on every character in the string between a and m.
It prints all letters in the string between a and m: http://ideone.com/lKaoI
|x| puts x is an annonymouse function, (or a "block", in ruby, as far as I can tell, or a lambda in other languages), that prints its argument.
More information on that can be found in:
Wikipedia - Ruby - Blocks and iterators
Understanding Ruby Blocks, Procs and Lambdas
The output is
h
i
i
a
e
Each character of the string "This is a test" is checked against the regular expression [a-m] which means "exactly one character in the range a..m, and is printed on its own line (via puts) if it matches. The first character T does not match, the second one h does match, etc. The last one that does is the e in "test".
In the context of your book's examples, it's included after each expression because it just means "Print out every match."
It is a code block, which runs for each match of the regular expression.
{ } creates the code block.
|x| creates the argument for the code block
puts prints out a string, and x is the string it prints.
The regular expression matches any single character in the character class [a-m]. Therefore, there are five different matches, and it prints out:
h
i
i
a
e
The { |x| puts x } defines a new block that takes a single argument named x. When the block is called, it passes its argument x to puts.
Another way to write the same thing would be:
"This is a test".scan(/[a-m]/) do |x|
puts x
end
The block gets called by the scan function each time the regular expression matches something in the string, so each match will get printed.
There is more information about blocks here:
http://www.ruby-doc.org/docs/ProgrammingRuby/html/tut_containers.html

Ruby regex question wrt the sub method on String

I'm running through the Koans tutorial (which is a great way to learn) and I've encountered this statement:
assert_equal __, "one two-three".sub(/(t\w*)/) { $1[0, 1] }
In this statement the __ is where I'm supposed to put my expected result to make the test execute correctly. I have stared at this for a while and have pulled most of it apart but I cannot figure out what the last bit means:
{ $1[0, 1] }
The expected answer is:
"one t-three"
and I was expecting:
"t-t"
{ $1[0, 1] } is a block containing the expression $1[0,1]. $1[0,1] evaluates to the first character of the string $1, which contains the contents of the first capturing group of the last matched regex.
When sub is invoked with a regex and a block, it will find the first match of the regex, invoke the block, and then replace the matched substring with the result of the block.
So "one two-three".sub(/(t\w*)/) { $1[0, 1] } searches for the pattern t\w*. This finds the substring "two". Since the whole thing is in a capturing group, this substring is stored in $1. Now the block is called and returns "two"[0,1], which is "t". So "two" is replaced by "t" and you get "one t-three".
An important thing to note is that sub, unlike gsub, only replaces the first occurrence, not ever occurrence of the pattern.
#sepp2k already gave a really good answer, I just wanted to add how you could have used IRB to maybe get there yourself:
>> "one two-three".sub(/(t\w*)/) { $1 } #=> "one two-three"
>> "one two-three".sub(/(t\w*)/) { $1[0] } #=> "one t-three"
>> "one two-three".sub(/(t\w*)/) { $1[1] } #=> "one w-three"
>> "one two-three".sub(/(t\w*)/) { $1[2] } #=> "one o-three"
>> "one two-three".sub(/(t\w*)/) { $1[3] } #=> "one -three"
>> "one two-three".sub(/(t\w*)/) { $1[0,3] } #=> "one two-three"
>> "one two-three".sub(/(t\w*)/) { $1[0,2] } #=> "one tw-three"
>> "one two-three".sub(/(t\w*)/) { $1[0,1] } #=> "one t-three"
Cribbing from the documentation (http://ruby-doc.org/core/classes/String.html#M001185), here are answers to your two questions "why is the return value 'one t-three'" and "what does { $1[0, 1] } mean?"
What does { $1[0, 1] } mean?
The method String#sub can take either two arguments, or one argument and a block. The latter is the form being used here and it's just like the method Integer.times, which takes a block:
5.times { puts "hello!" }
So that explains the enclosing curly braces.
$1 is the substring matching the first capture group of the regex, as described here. [0, 1] is the string method "[]" which returns a substring based on the array values - here, the first character.
Put together, { $1[0, 1] } is a block which returns the first character in $1, where $1 is the substring to have been matched by a capture group when a regex was last used to match a string.
Why is the return value 'one t-three'?
The method String#sub ('substitute'), unlike its brother String#gsub ('globally substitute'), replaces the first portion of the string matching the regex with its replacement. Hence the method is going to replace the first substring matching "(t\w*)" with the value of the block described above - i.e. with its first character. Since 'two' is the first substring matching (t\w*) (a 't' followed by any number of letters), it is replaced by its first character, 't'.

Resources