Why is splitting strings inconsistent? - ruby

Examine:
"test.one.two".split(".") # => ["test", "one", "two"]
Right, perfect. Exactly what we want.
"test..two".split(".") # => ["test", "", "two"]
I replaced one with the empty string, so that makes sense
"test".split(".") # => ["test"]
That's what I would expect, no problems here.
".test".split(".") # => ["", "test"]
Yep, my string has one . so I got two sections as a result.
"test.".split(".") # => ["test"]
What? There's a . in my string, it should have been split into two sections. I didn't ask to get rid of empty strings; it didn't get rid of empty strings back in tests 2 or 4.
I would have expected ["test", ""]
"".split(".") # => []
WHAT? This should operate almost exactly like test 3, and return [""]. But now I can't perform any string methods on result[0]
Why is this inconsistent for splits that occur on the edges, or for the empty string?

The documentation explains this well: http://ruby-doc.org/core-2.2.0/String.html#method-i-split
If the limit parameter is omitted, trailing null fields are suppressed. If limit is a positive number, at most that number of
fields will be returned (if limit is 1, the entire string is returned
as the only entry in an array). If negative, there is no limit to the
number of fields returned, and trailing null fields are not
suppressed.
So, this does what you'd expect:
"test.".split(".", -1)
=> ["test", ""]
The rest is there in the docs.

Related

Inconsistent Ruby .split behavior [duplicate]

This question already has an answer here:
How do I avoid trailing empty items being removed when splitting strings?
(1 answer)
Closed 5 years ago.
Suppose I have this:
a = "|hello"
if I do:
a.split("|") #=> ["", "hello"]
Now say I have:
b = "hello|"
if I do:
b.split("|") #=> ["hello"]
Why is this happening? I expected the result to be ["hello", ""] , similar to the first example. This is the split method working inconsistently. Or is there something about its inner working that I'm not aware of?
This behaviour is described in documentation:
If the limit parameter is omitted, trailing null fields are
suppressed.
If you want to save trailing empty string, just add positive or negative limit, as documentation offering:
"hello|".split('|', 2)
#=> ["hello", ""]
"hello|||".split('|', -1)
#=> ["hello", "", "", ""]
Note
If negative, there is no limit to the number of fields returned, and trailing null fields are not suppressed.

How do you capture part of a regex to a variable in Ruby?

I know about "string"[/regex/], which returns the part of the string that matches. But what if I want to return only the captured part(s) of a string?
I have the string "1952-FEB-21_70_The_Case_of_the_Gold_Ring.mp3". I want to store in the variable title the text The_Case_of_the_Gold_Ring.
I can capture this part with the regex /\d_(?!.*\d_)(.*).mp3$/i. But writing the Ruby "1952-FEB-21_70_The_Case_of_the_Gold_Ring.mp3"[/\d_(?!.*\d_)(.*).mp3$/i] returns 0_The_Case_of_the_Gold_Ring.mp3 which isn't what I want.
I can get what I want by writing
"1952-FEB-21_70_The_Case_of_the_Gold_Ring.mp3" =~ /\d_(?!.*\d_)(.*).mp3$/i
title = $~.captures[0]
But this seems sloppy. Surely there's a proper way to do this?
(I'm aware that someone can probably write a simpler regex to target the text I want that lets the "string"[/regex/] method work, but this is just an example to illustrate the problem, the specific regex isn't the issue.)
You can pass number of part to [/regexp/, index] method:
=> string = "1952-FEB-21_70_The_Case_of_the_Gold_Ring.mp3"
=> string[/\d_(?!.*\d_)(.*).mp3$/i, 1]
=> "The_Case_of_the_Gold_Ring"
=> string[/\d_(?!.*\d_)(.*).mp3$/i, 0]
=> "0_The_Case_of_the_Gold_Ring.mp3"
Have a look at the match method:
string = "1952-FEB-21_70_The_Case_of_the_Gold_Ring.mp3"
regexp = /\d_(?!.*\d_)(.*).mp3$/i
matches = regexp.match(string)
matches[1]
#=> "The_Case_of_the_Gold_Ring"
Where matches[0] would return the whole match and matches[1] (and following) returns all subcaptures:
matches.to_a
#=> ["0_The_Case_of_the_Gold_Ring.mp3", "The_Case_of_the_Gold_Ring"]
Read more examples: http://ruby-doc.org/core-2.1.4/MatchData.html#method-i-5B-5D
You can use named captures
"1952-FEB-21_70_The_Case_of_the_Gold_Ring.mp3" =~ /\d_(?!.*\d_)(?<title>.*).mp3$/i
and $~[:title] will give you want you want
Meditate on this:
Here's the source string to be parsed:
str = "1952-FEB-21_70_The_Case_of_the_Gold_Ring.mp3"
Patterns can be defined as strings:
DATE_REGEX = '\d{4}-[A-Z]{3}-\d{2}'
SERIAL_REGEX = '\d{2}'
TITLE_REGEX = '.+'
Then interpolated into a regexp:
regex = /^(#{ DATE_REGEX })_(#{ SERIAL_REGEX })_(#{ TITLE_REGEX })/
# => /^(\d{4}-[A-Z]{3}-\d{2})_(\d{2})_(.+)/
The advantage to that is it's easier to maintain because the pattern is really several smaller ones.
str.match(regex) # => #<MatchData "1952-FEB-21_70_The_Case_of_the_Gold_Ring.mp3" 1:"1952-FEB-21" 2:"70" 3:"The_Case_of_the_Gold_Ring.mp3">
regex.match(str) # => #<MatchData "1952-FEB-21_70_The_Case_of_the_Gold_Ring.mp3" 1:"1952-FEB-21" 2:"70" 3:"The_Case_of_the_Gold_Ring.mp3">
are equivalent because both Regexp and String implement match.
We can retrieve what was captured as an array:
regex.match(str).captures # => ["1952-FEB-21", "70", "The_Case_of_the_Gold_Ring.mp3"]
regex.match(str).captures.last # => "The_Case_of_the_Gold_Ring.mp3"
We can also name the captures and access them like we would a hash:
regex = /^(?<date>#{ DATE_REGEX })_(?<serial>#{ SERIAL_REGEX })_(?<title>#{ TITLE_REGEX })/
matches = regex.match(str)
matches[:date] # => "1952-FEB-21"
matches[:serial] # => "70"
matches[:title] # => "The_Case_of_the_Gold_Ring.mp3"
Of course, it's not necessary to mess with that rigamarole at all. We can split the string on underscores ('_'):
str = "1952-FEB-21_70_The_Case_of_the_Gold_Ring.mp3"
str.split('_') # => ["1952-FEB-21", "70", "The", "Case", "of", "the", "Gold", "Ring.mp3"]
split can take a limit parameter saying how many times it should split the string. Passing in 3 gives us:
str.split('_', 3) # => ["1952-FEB-21", "70", "The_Case_of_the_Gold_Ring.mp3"]
Grabbing the last element returns:
str.split('_', 3).last # => "The_Case_of_the_Gold_Ring.mp3"
I believe it would be easiest to use a capture group here, but I'd like to present some possibilities that do not, for illustrative purposes. All employ the same positive lookahead ((?=\.mp3$)). all but one use a positive lookbehind and one uses \K to "forget" the match up to the last character before beginning of the desired match. Some permit the matched string to contain digits (.+); others do not ([^\d]).
str = "1952-FEB-21_70_The_Case_of_the_Gold_Ring.mp3"
1 # match follows last digit followed by underscore, cannot contain digits
str[/(?<=\d_)[^\d]+(?=\.mp3$)/]
#=> "The_Case_of_the_Gold_Ring"
2 # same as 1, as `\K` disregards match to that point
str[/\d_\K[^\d]+(?=\.mp3$)/]
#=> "The_Case_of_the_Gold_Ring"
3 # match follows underscore, two digits, underscore, may contain digits
str[/(?<=_\d\d_).+(?=\.mp3$)/]
#=> "The_Case_of_the_Gold_Ring"
4 # match follows string having specfic pattern, may contain digits
str[/(?<=\d{4}-[A-Z]{3}-\d{2}_\d{2}_).+(?=\.mp3$)/]
#=> "The_Case_of_the_Gold_Ring"
5 # match follows digit, any 12 characters, another digit and underscore,
# may contain digits
str[/(?<=\d.{12}\d_).+(?=\.mp3$)/]
#=> "The_Case_of_the_Gold_Ring"

How can I match Word Boundary "or" [##]?

I can't seem to get a regex that matches either a hashtag #, an #, or a word-boundary. The goal is to break a string into Twitter-like entities and topics so:
input = "Hello #world, #ruby anotherString"
input.scan(entitiesRegex)
# => ["Hello", "#world", "#ruby", "anotherString"]
To get just the words, excluding "anotherString" which is too large, is simple:
/\b\w{3,12}\b/
will return ["Hello", "world", "ruby"]. Unfortunately this doesn't include the hashtags and #s. It seems like it should work simply with:
/[\b##]\w{3,12}\b/
but that returns ["#world", "#ruby"]. This made me realize that word boundaries are not by definition a character, so they don't fall into the category of "A single character" and, so, won't match. A few more attempts:
/\b|[##]\w{3,12}\b/
returns ["", "", "#world", "", "#ruby", "", "", ""].
/((\b|[##])\w{3,12}\b)/
matches the right things, but returns [[""], ["#"], ["#"], [""]] as expected, because the braces also mean capture everything enclosed.
/((\b|[##])\w{3,12}\b)/
kind of works. It returns [["Hello", ""], ["#world", "#"], ["#ruby", "#"]]. So now all the correct items are there, they're just located at the first element of each of the subarrays. The following snippet technically works:
input.scan(/((\b|[##])\w{3,12}\b)/).collect(&:first)
Is it possible to simplify this to match and return the correct substrings with just the regular expression not requiring the collect post-processing?
You can just use the regular expression /[##]?\b\w+\b/. That is, optionally match a # or #, followed by a word boundary (in #ruby, that boundary would be between # and ruby, in a normal word it would also match at the start of the word) and a bunch of word characters.
p "Hello #world, #ruby anotherString".scan(/[##]?\b\w+\b/)
# => ["Hello", "#world", "#ruby", "anotherString"]
Furthermore, you can adjust the number of characters a matching word should have with quantifiers. You gave an example in a comment to a deleted answer to match only #ruby by using {3,4}:
p "Hello #world, #ruby anotherString".scan(/[##]?\b\w{3,4}\b/)
# => ["#ruby"]

What's the difference between scan and match on Ruby string

I am new to Ruby and has always used String.scan to search for the first occurrence of a number. It is kind of strange that the returned value is in nested array, but I just go [0][0] for the values I want. (I am sure it has its purpose, just that I haven't used it yet.)
I just found out that there is a String.match method. And it seems to be more convenient because the returned array is not nested.
Here is an example of the two, first is scan:
>> 'a 1-night stay'.scan(/(a )?(\d*)[- ]night/i).to_a
=> [["a ", "1"]]
then is match
>> 'a 1-night stay'.match(/(a )?(\d*)[- ]night/i).to_a
=> ["a 1-night", "a ", "1"]
I have check the API, but I can't really differentiate the difference, as both referred to 'match the pattern'.
This question is, for simply out curiousity, about what scan can do that match can't, and vise versa. Any specific scenario that only one can accomplish? Is match the inferior of scan?
Short answer: scan will return all matches. This doesn't make it superior, because if you only want the first match, str.match[2] reads much nicer than str.scan[0][1].
ruby-1.9.2-p290 :002 > 'a 1-night stay, a 2-night stay'.scan(/(a )?(\d*)[- ]night/i).to_a
=> [["a ", "1"], ["a ", "2"]]
ruby-1.9.2-p290 :004 > 'a 1-night stay, a 2-night stay'.match(/(a )?(\d*)[- ]night/i).to_a
=> ["a 1-night", "a ", "1"]
#scan returns everything that the Regex matches.
#match returns the first match as a MatchData object, which contains data held by special variables like $& (what was matched by the Regex; that's what's mapping to index 0), $1 (match 1), $2, et al.
Previous answers state that scan will return every match from the string the method is called on but this is incorrect.
Scan keeps track of an index and continues looking for subsequent matches after the last character of the previous match.
string = 'xoxoxo'
p string.scan('xo') # => ['xo' 'xo' 'xo' ]
# so far so good but...
p string.scan('xox') # => ['xox']
# if this retured EVERY instance of 'xox' it would include a substring
# starting at indices 0 and 2 but only one match is found

Must a gsub hash key be a string, not a regexp?

I want to do a sequence of gsubs against one string, so I utilized the fact that gsub can take a hash as the second argument. One thing I wanted to do with gsub is to convert a sequence of one or more space/tab into a single space, so I have something essentially as follows:
gsub(/[ \t]+/, {/[ \t]+/ => ' '})
In my actual code, the first argument is a union of the regexp I gave here, and the second argument includes more key-value pairs.
Now, when I apply this to a string, all of the space/tabs are deleted. I suppose this is because the match to the first argument is not regarded as matching to the key [ \t] in the second argument (hash). Does the match in the second argument hash only looks for exact string match, not regexp match? If so, is there any way to get around it?
This is a related question. If you need to use the hash because many things have to be substituted, this might work:
list = Hash.new{|h,k|if /\s+/ =~ k then ' ' else k end}
list['foo'] = 'bar'
list['apple'] = 'banana'
p "appleabc\t \tabc apple foo".gsub(/\w+|\W+/,list)
#=> "appleabc abc banana bar"
p list
#=>{"foo"=>"bar", "apple"=>"banana"} no garbage
According to the docs, gsub with a hash as the second parameter only matches against literal strings:
'hello'.gsub(/[eo]/, 'e' => 3, 'o' => '*') #=> "h3ll*"
If you want to supply multiple hashes you could work around it by creating a hash, where the key/value pairs are the search => replacement pairs, iterate over the hash, and pass those into the gsub. Because Ruby 1.9+ maintains the insertion order of the hash, you're guaranteed that the search will occur in the order you want.
search_hash = {
'1' => 'one',
'too' => 'two',
/[\t ]+/ => ' '
}
str = "1, too,\t3 , four"
search_hash.each { |n,v| str.gsub!(n, v) }
str #=> "one, two, 3 , four"
If you just want the spaces/tabs replaced with one space, why not just specify that as the replacement, and omit the whole hash?
gsub(/[ \t]+/, ' ')
UPDATE: based on your comment, you can use the block syntax of gsub
gsub(/[ \t]+/) {|match| *do stuff here* }

Resources