How to split a string containing both delimiter and the escaped delimiter? - ruby

My string delimiter is ;. Delimiter is escaped in the string as \;. E.g.,
irb(main):018:0> s = "a;b;;d\\;e"
=> "a;b;;d\\;e"
irb(main):019:0> s.split(';')
=> ["a", "b", "", "d\\", "e"]
Could someone suggest me regex so the output of split would be ["a", "b", "", "d\\;e"]? I'm using Ruby 1.8.7

1.8.7 doesn't have negative lookbehind without Oniguruma (which may be compiled in).
1.9.3; yay:
> s = "a;b;c\\;d"
=> "a;b;c\\;d"
> s.split /(?<!\\);/
=> ["a", "b", "c\\;d"]
1.8.7 with Oniguruma doesn't offer a trivial split, but you can get match offsets and pull apart the substrings that way. I assume there's a better way to do this I'm not remembering:
> require 'oniguruma'
> re = Oniguruma::ORegexp.new "(?<!\\\\);"
> s = "hello;there\\;nope;yestho"
> re.match_all s
=> [#<MatchData ";">, #<MatchData ";">]
> mds = re.match_all s
=> [#<MatchData ";">, #<MatchData ";">]
> mds.collect {|md| md.offset}
=> [[5, 6], [17, 18]]
Other options include:
Splitting on ; and post-processing the results looking for trailing \\, or
Do a char-by-char loop and maintain some simple state and just split manually.

As #dave-newton answered, you could use negative lookbehind, but that isn't supported in 1.8. An alternative that will work in both 1.8 and 1.9, is to use String#scan instead of split, with a pattern accepting not (semicolon or backslash) or anychar prefixed by backlash:
$ irb
>> RUBY_VERSION
=> "1.8.7"
>> s = "a;b;c\\;d"
=> "a;b;c\\;d"
s.scan /(?:[^;\\]|\\.)+/
=> ["a", "b", "c\\;d"]

Related

How do I split a string on capitals unless preceded by a '+'

I have a CamelCased string, which I would like to split into individual words at the capitals, unless the capital is preceded by a '+':
Splitting on the caps is fairly simple in Ruby: s.split(/(?=[A-Z])/)
But I can't figure out how to add the "except after '+'" part.
For example:
s = "FooBashFizz+BuzzXBar"
p s.split(/(?=[A-Z])/)
=> ["Foo", "Bash", "Fizz+", "Buzz", "X", "Bar"]
desired:
=> ["Foo", "Bash", "Fizz+Buzz", "X", "Bar"]
Add a negative lookbehind at the start.
irb(main):001:0> s = "FooBashFizz+BuzzXBar"
=> "FooBashFizz+BuzzXBar"
irb(main):002:0> s.split(/(?<!\+)(?=[A-Z])/)
=> ["Foo", "Bash", "Fizz+Buzz", "X", "Bar"]
Explanation:
(?<!\+) Asserts that the preceding character would be any but not a + symbol.
(?=[A-Z]) Asserts that the following character must be an uppercase letter.
Alternative using String#scan. This also works in Ruby 1.8.
s = "FooBashFizz+BuzzXBar"
s.scan(/[A-Z][a-z]*(?:\+[A-Z][a-z]*)*/)
# => ["Foo", "Bash", "Fizz+Buzz", "X", "Bar"]

Is there a version of Ruby's Regexp.match that responds to the order of the matches within the string?

I want to use regexes to check if a given string is composed of certain substrings.
For example, given the regular expression
> regex = /(?:(foo)|(bar)|(baz))*/
I can determine whether a given string matches the pattern:
> regex === "bazbar"
=> true
> regex === "qux"
=> false
But I want to know how to break the string into substrings. I can almost do this with
> regex.match("barbazfoo").captures
=> ["foo", "bar", "baz"]
But here they appear in the order in which I specified them within the regex. I want to return
["bar", "baz", "foo"]
In the order in which they appeared in the string.
You can use String#scan with a modified regular expression:
regex = /foo|bar|baz/
"barbazfoo".scan(regex)
# => ["bar", "baz", "foo"]
UPDATE according to OP's comment.
If some of the strings I'm using are substrings of the others, you need to order the so that all the substrings go last.
"barfoo".scan(/ba|bar|foo/) # without ordering
# => ["ba", "foo"]
words = ['ba', 'bar', 'foo']
pattern = words.map { |word| Regexp.escape(word) }.sort_by { |x| -x.size }.join('|')
"barfoo".scan(Regexp.new(pattern))
# => ["bar", "foo"]

Split Unicode entities by graphemes

"d̪".chars.to_a
gives me
["d"," ̪"]
How do I get Ruby to split it by graphemes?
["d̪"]
Edit: As #michau's answer notes, Ruby 2.5 introduced the grapheme_clusters method, as well as each_grapheme_cluster if you just want to iterate/enumerate without necessarily creating an array.
In Ruby 2.0 or above you can use str.scan /\X/
> "d̪".scan /\X/
=> ["d̪"]
> "d̪d̪d̪".scan /\X/
=> ["d̪", "d̪", "d̪"]
# Let's get crazy:
> str = 'Z͑ͫ̓ͪ̂ͫ̽͏̴̙̤̞͉͚̯̞̠͍A̴̵̜̰͔ͫ͗͢L̠ͨͧͩ͘G̴̻͈͍͔̹̑͗̎̅͛́Ǫ̵̹̻̝̳͂̌̌͘!͖̬̰̙̗̿̋ͥͥ̂ͣ̐́́͜͞'
> str.length
=> 75
> str.scan(/\X/).length
=> 6
If you want to match the grapheme boundaries for any reason, you can use (?=\X) in your regex, for instance:
> "d̪".split /(?=\X)/
=> ["d̪"]
ActiveSupport (which is included in Rails) also has a way if you can't use \X for some reason:
ActiveSupport::Multibyte::Unicode.unpack_graphemes("d̪").map { |codes| codes.pack("U*") }
The following code should work in Ruby 2.5:
"d̪".grapheme_clusters # => ["d̪"]
Use Unicode::text_elements from unicode.gem which is documented at http://www.yoshidam.net/unicode.txt.
irb(main):001:0> require 'unicode'
=> true
irb(main):006:0> s = "abčd̪é"
=> "abčd̪é"
irb(main):007:0> s.chars.to_a
=> ["a", "b", "č", "d", "̪", "é"]
irb(main):009:0> Unicode.nfc(s).chars.to_a
=> ["a", "b", "č", "d", "̪", "é"]
irb(main):010:0> Unicode.nfd(s).chars.to_a
=> ["a", "b", "c", "̌", "d", "̪", "e", "́"]
irb(main):017:0> Unicode.text_elements(s)
=> ["a", "b", "č", "d̪", "é"]
Ruby2.0
str = "d̪"
char = str[/\p{M}/]
other = str[/\w/]

putting enumeration with spaces in rails collection

irb(main):001:0> t = %w{this is a test}
=> ["this", "is", "a", "test"]
irb(main):002:0> t.size
=> 4
irb(main):003:0> t = %w{"this is" a test}
=> ["\"this", "is\"", "a", "test"]
irb(main):004:0> t.size
=> 4
In the end I expected t.size to be 3.
As suggested, each space has to be escaped ...which turns out to be a lot of work. What other options are there? I have a list of about 30 words that I need to put in a collection because I am showing them as checkboxes using simple_form
Why not just use a normal array so no one has to visually parse all the escaping to figure out what's going on? This is pretty clear:
t = [
'this is',
'a',
'test'
]
and the people maintaining your code won't hate you for using %w{} when it isn't appropriate or when they mess things up because they didn't see your escaped whitespace.
You need to escape the space with a '\', like t = %w{this\ is a test} if you dont want that space to be a splitter.
Escape the space using \:
%w{this\ is a test}
You can escape the space %w{this\ is a test} to get ['this is', 'a', 'test'], but in general I wouldn't use %w unless then intention is to split on whitespace.
As others have pointed out use the %w{} construct when spaces are the separator for the words. If you have items that must be quoted and still want to use the construct you can do:
> %w{a test here}.unshift("This is")
=> ["This is", "a", "test", "here"]
require 'csv'
str = '"this is" a test'
p CSV.parse_line(str,{:col_sep=>' '})
#=> ["this is", "a", "test"]

Why isn't there a String#shift()?

I'm working my way through Project Euler, and ran into a slightly surprising omission: There is no String#shift, unshift, push, or pop. I had assumed a String was considered a "sequential" object like an Array, since they share the ability to be indexed and iterated through, and that this would include the ability to easily change the beginning and ends of the object.
I know there are ways to create the same effects, but is there a specific reason that String does not have these methods?
Strings don't act as an enumerable object as of 1.9, because it's considered too confusing to decide what it'd be a list of:
A list of characters / codepoints?
A list of bytes?
A list of lines?
Not being a Ruby contributor, I can't speak to their design goals, but from experience, I don't think that strings are regarded as 'sequential' objects; they're mutable in ways that suggest sequential behaviour, but most of the time they're treated atomically.
Case in point: in Ruby 1.9, String no longer mixes in Enumerable.
>> mystring = "abcdefgh"
=> "abcdefgh"
>> myarray = mystring.split("")
=> ["a", "b", "c", "d", "e", "f", "g", "h"]
>> myarray.pop
=> "h"
>> mystring = myarray.join
=> "abcdefg"
this should do it, you wouldhave to convert it to an array, and then back though
UPDATE:
use String#chop! and Stirng#<<
>> s = "abc"
=> "abc"
>> s.chop!
=> "ab"
>> s
=> "ab"
>> s<<"def"
=> "abdef"
>> s
=> "abdef"
>>
Well at least in 1.9.2, you can deal with a string like an array.
ruby-1.9.2-p290 :001 > "awesome"[3..-1] => "some"
So if you want to do a sort of character left shift, just use [1..-1]
ruby-1.9.2-p290 :003 > "fooh!"[1..-1] => "ooh!"

Resources