How do I split a string on capitals unless preceded by a '+' - ruby

I have a CamelCased string, which I would like to split into individual words at the capitals, unless the capital is preceded by a '+':
Splitting on the caps is fairly simple in Ruby: s.split(/(?=[A-Z])/)
But I can't figure out how to add the "except after '+'" part.
For example:
s = "FooBashFizz+BuzzXBar"
p s.split(/(?=[A-Z])/)
=> ["Foo", "Bash", "Fizz+", "Buzz", "X", "Bar"]
desired:
=> ["Foo", "Bash", "Fizz+Buzz", "X", "Bar"]

Add a negative lookbehind at the start.
irb(main):001:0> s = "FooBashFizz+BuzzXBar"
=> "FooBashFizz+BuzzXBar"
irb(main):002:0> s.split(/(?<!\+)(?=[A-Z])/)
=> ["Foo", "Bash", "Fizz+Buzz", "X", "Bar"]
Explanation:
(?<!\+) Asserts that the preceding character would be any but not a + symbol.
(?=[A-Z]) Asserts that the following character must be an uppercase letter.

Alternative using String#scan. This also works in Ruby 1.8.
s = "FooBashFizz+BuzzXBar"
s.scan(/[A-Z][a-z]*(?:\+[A-Z][a-z]*)*/)
# => ["Foo", "Bash", "Fizz+Buzz", "X", "Bar"]

Related

Is there a version of Ruby's Regexp.match that responds to the order of the matches within the string?

I want to use regexes to check if a given string is composed of certain substrings.
For example, given the regular expression
> regex = /(?:(foo)|(bar)|(baz))*/
I can determine whether a given string matches the pattern:
> regex === "bazbar"
=> true
> regex === "qux"
=> false
But I want to know how to break the string into substrings. I can almost do this with
> regex.match("barbazfoo").captures
=> ["foo", "bar", "baz"]
But here they appear in the order in which I specified them within the regex. I want to return
["bar", "baz", "foo"]
In the order in which they appeared in the string.
You can use String#scan with a modified regular expression:
regex = /foo|bar|baz/
"barbazfoo".scan(regex)
# => ["bar", "baz", "foo"]
UPDATE according to OP's comment.
If some of the strings I'm using are substrings of the others, you need to order the so that all the substrings go last.
"barfoo".scan(/ba|bar|foo/) # without ordering
# => ["ba", "foo"]
words = ['ba', 'bar', 'foo']
pattern = words.map { |word| Regexp.escape(word) }.sort_by { |x| -x.size }.join('|')
"barfoo".scan(Regexp.new(pattern))
# => ["bar", "foo"]

How could I split string and keep the whitespaces, as well?

I did the following in Python:
s = 'This is a text'
re.split('(\W)', s)
# => ['This', ' ', 'is', ' ', 'a', 'text']
It worked just great. How do I do the same split in Ruby?
I've tried this, but it eats up my whitespace.:
s = "This is a text"
s.split(/[\W]/)
# => ["This", "is", "a", "text"]
From the String#split documentation:
If pattern contains groups, the respective matches will be returned in
the array as well.
This works in Ruby the same as in Python, square brackets are for specify character classes, not match groups:
"foo bar baz".split(/(\W)/)
# => ["foo", " ", "bar", " ", "baz"]
toro2k's answer is most straightforward. Alternatively,
string.scan(/\w+|\W+/)

How to split a string containing both delimiter and the escaped delimiter?

My string delimiter is ;. Delimiter is escaped in the string as \;. E.g.,
irb(main):018:0> s = "a;b;;d\\;e"
=> "a;b;;d\\;e"
irb(main):019:0> s.split(';')
=> ["a", "b", "", "d\\", "e"]
Could someone suggest me regex so the output of split would be ["a", "b", "", "d\\;e"]? I'm using Ruby 1.8.7
1.8.7 doesn't have negative lookbehind without Oniguruma (which may be compiled in).
1.9.3; yay:
> s = "a;b;c\\;d"
=> "a;b;c\\;d"
> s.split /(?<!\\);/
=> ["a", "b", "c\\;d"]
1.8.7 with Oniguruma doesn't offer a trivial split, but you can get match offsets and pull apart the substrings that way. I assume there's a better way to do this I'm not remembering:
> require 'oniguruma'
> re = Oniguruma::ORegexp.new "(?<!\\\\);"
> s = "hello;there\\;nope;yestho"
> re.match_all s
=> [#<MatchData ";">, #<MatchData ";">]
> mds = re.match_all s
=> [#<MatchData ";">, #<MatchData ";">]
> mds.collect {|md| md.offset}
=> [[5, 6], [17, 18]]
Other options include:
Splitting on ; and post-processing the results looking for trailing \\, or
Do a char-by-char loop and maintain some simple state and just split manually.
As #dave-newton answered, you could use negative lookbehind, but that isn't supported in 1.8. An alternative that will work in both 1.8 and 1.9, is to use String#scan instead of split, with a pattern accepting not (semicolon or backslash) or anychar prefixed by backlash:
$ irb
>> RUBY_VERSION
=> "1.8.7"
>> s = "a;b;c\\;d"
=> "a;b;c\\;d"
s.scan /(?:[^;\\]|\\.)+/
=> ["a", "b", "c\\;d"]

What is the difference between %w and %W

I'm looking at the documentation for Ruby. I'm confused between using %w() or %W() (Later W is upcase). What is the difference between both? Can you point me to some documentation?
When capitalized, the array is constructed from strings that are interpolated, as would happen in a double-quoted string; when lowercased, it is constructed from strings that are not interpolated, as would happen in a single-quoted string. For example:
irb(main):001:0> foo = "bar"
=> "bar"
irb(main):002:0> %w(#{foo} bar baz)
=> ["\#{foo}", "bar", "baz"]
irb(main):003:0> %W(#{foo} bar baz)
=> ["bar", "bar", "baz"]
irb(main):004:0> ^D

putting enumeration with spaces in rails collection

irb(main):001:0> t = %w{this is a test}
=> ["this", "is", "a", "test"]
irb(main):002:0> t.size
=> 4
irb(main):003:0> t = %w{"this is" a test}
=> ["\"this", "is\"", "a", "test"]
irb(main):004:0> t.size
=> 4
In the end I expected t.size to be 3.
As suggested, each space has to be escaped ...which turns out to be a lot of work. What other options are there? I have a list of about 30 words that I need to put in a collection because I am showing them as checkboxes using simple_form
Why not just use a normal array so no one has to visually parse all the escaping to figure out what's going on? This is pretty clear:
t = [
'this is',
'a',
'test'
]
and the people maintaining your code won't hate you for using %w{} when it isn't appropriate or when they mess things up because they didn't see your escaped whitespace.
You need to escape the space with a '\', like t = %w{this\ is a test} if you dont want that space to be a splitter.
Escape the space using \:
%w{this\ is a test}
You can escape the space %w{this\ is a test} to get ['this is', 'a', 'test'], but in general I wouldn't use %w unless then intention is to split on whitespace.
As others have pointed out use the %w{} construct when spaces are the separator for the words. If you have items that must be quoted and still want to use the construct you can do:
> %w{a test here}.unshift("This is")
=> ["This is", "a", "test", "here"]
require 'csv'
str = '"this is" a test'
p CSV.parse_line(str,{:col_sep=>' '})
#=> ["this is", "a", "test"]

Resources