How to split string with accented characters in ruby - ruby

Currently I got :
"mɑ̃ʒe".split('')
# => ["m", "ɑ", "̃", "ʒ", "e"]
I would like to get this result
"mɑ̃ʒe".split('')
# => ["m", "ã", "ʒ", "e"]

Use String#each_grapheme_cluster instead. For example:
"mɑ̃ʒe".each_grapheme_cluster.to_a
#=> ["m", "ɑ̃", "ʒ", "e"]

Related

Zip all array values of hash

I'd like to zip all the array values of a hash. I know there's a way to zip arrays together. I'd like to do that with the values of my hash below.
current_hash = {:a=>["k", "r", "u"],
:b=>["e", " ", "l"],
:c=>["d", "o", "w"],
:d=>["e", "h"]
}
desired_outcome = "keder ohulw"
I have included my desired outcome above.
current_hash.values.then { |first, *rest| first.zip(*rest) }.flatten.compact.join
An unfortunate thing with Ruby zip is that the first enumerable needs to be the receiver, and the others need to be parameters. Here, I use then, parameter deconstruction and splat to separate the first enumerable from the rest. flatten gets rid of the column arrays, compact gets rid of the nil (though it's not really necessary as join will ignore it), and join turns the array into the string.
Note that Ruby zip will stop at length of the receiver; so if :a is shorter than the others, you will likely have a surprising result. If that is a concern, please update with an example that reflects that scenario, and the desired outcome.
Here I'm fleshing out #Amadan's remark below the horizontal line in is answer. Suppose:
current_hash = { a:["k","r"], b:["e"," ","l"], c:["d","o","w"], d:["e", "h"] }
and you wished to return "keder ohlw". If you made ["k","r"] and [["e"," ","l"], ["d","o","w"], ["e", "h"]] zip's receiver and argument, respectively, you would get "keder oh", which omits "l" and "w". (See Array#zip, especially the 3rd paragraph.)
To include those strings you would need to fill out ["k","r"] with nils to make it as long as the longest value, or make zip's receiver an array of nils of the same length. The latter approach can be implemented as follows:
vals = current_hash.values
#=> [["k", "r"], ["e", " ", "l"], ["d", "o", "w"], ["e", "h"]]
([nil]*vals.map(&:size).max).zip(*vals).flatten.compact.join
#=> "keder ohlw"
Note:
a = [nil]*vals.map(&:size).max
#=> [nil, nil, nil]
and
a.zip(*vals)
#=> [[nil, "k", "e", "d", "e"],
# [nil, "r", " ", "o", "h"],
# [nil, nil, "l", "w", nil]]
One could alternatively use Array#transpose rather than zip.
vals = current_hash.values
idx = (0..vals.map(&:size).max-1).to_a
#=> [0, 1, 2]
vals.map { |a| a.values_at(*idx) }.transpose.flatten.compact.join
#=> "keder ohlw"
See Array#values_at. Note:
a = vals.map { |a| a.values_at(*idx) }
#=> [["k", "r", nil],
# ["e", " ", "l"],
# ["d", "o", "w"],
# ["e", "h", nil]]
a.transpose
#=> [["k", "e", "d", "e"],
# ["r", " ", "o", "h"],
# [nil, "l", "w", nil]]

Split string by regex in Ruby

I need to split a string by commas that are outside brackets. I have this string:
'a,b,c,d[a,b,c[a,b]],e'
and my split needs to return:
['a', 'b', 'c', 'd[a,b,c[a,b]]', 'e']
How can I do that?
'a,b,c,d[a,b,c[a,b]],e'
.scan(/(?:\[[^\]]*\]|[^,])+/)
# => ["a", "b", "c", "d[a,b,c[a,b]]", "e"]
'a,[a][b],e'
.scan(/(?:\[[^\]]*\]|[^,])+/)
# => ["a", "[a][b]", "e"]

How do I split on multiple conditions?

With Ruby how do I split on either one of tow conditions -- wheter there are 3 or more spaces or a tab charadter? I tried this
2.4.0 :003 > line = "a\tb\tc"
=> "a\tb\tc"
2.4.0 :004 > line.split(/([[:space:]][[:space:]][[:space:]]+|\t)/)
=> ["a", "\t", "b", "\t", "c"]
but as you can see, the tab character itself is getting included in my results. The results should be
["a", "b", "c"]
What about just split?
p "a\tb\tc".split
# ["a", "b", "c"]
p "a\tb\tc\t\tc\t\t\t\t\t\t\tc\ts\ts\tt".split
# ["a", "b", "c", "c", "c", "s", "s", "t"]
Although that doesn't split when there are three 3 or more white spaces, this might work:
p "a\tb\tc\t\tc\t\t\ t\t\tc\ts\ts\tt".split(/\s{3,}|\t/)
# => ["a", "b", "c", "c", "t", "c", "s", "s", "t"]
line = "aa bb cc\tdd"
line.split /\p{Space}{3,}|\t+/
#⇒ ["aa bb", "cc", "dd"]

Splitting a string into an array ruby

I am trying to turn this string
"a,bc,c"
into this array..
["a", "b", "c"]
I've used split on the comma & iterated through it but I'd like to find a cleaner way.
Thanks!
I will use #scan and #uniq method.
"a, bc,c".scan(/[a-z]/).uniq
# => ["a", "b", "c"]
Here we go, one option:
"a, bc,c".gsub(/\W+/, '').chars.uniq
# Outputs:
=> ["a", "b", "c"]

Split Unicode entities by graphemes

"d̪".chars.to_a
gives me
["d"," ̪"]
How do I get Ruby to split it by graphemes?
["d̪"]
Edit: As #michau's answer notes, Ruby 2.5 introduced the grapheme_clusters method, as well as each_grapheme_cluster if you just want to iterate/enumerate without necessarily creating an array.
In Ruby 2.0 or above you can use str.scan /\X/
> "d̪".scan /\X/
=> ["d̪"]
> "d̪d̪d̪".scan /\X/
=> ["d̪", "d̪", "d̪"]
# Let's get crazy:
> str = 'Z͑ͫ̓ͪ̂ͫ̽͏̴̙̤̞͉͚̯̞̠͍A̴̵̜̰͔ͫ͗͢L̠ͨͧͩ͘G̴̻͈͍͔̹̑͗̎̅͛́Ǫ̵̹̻̝̳͂̌̌͘!͖̬̰̙̗̿̋ͥͥ̂ͣ̐́́͜͞'
> str.length
=> 75
> str.scan(/\X/).length
=> 6
If you want to match the grapheme boundaries for any reason, you can use (?=\X) in your regex, for instance:
> "d̪".split /(?=\X)/
=> ["d̪"]
ActiveSupport (which is included in Rails) also has a way if you can't use \X for some reason:
ActiveSupport::Multibyte::Unicode.unpack_graphemes("d̪").map { |codes| codes.pack("U*") }
The following code should work in Ruby 2.5:
"d̪".grapheme_clusters # => ["d̪"]
Use Unicode::text_elements from unicode.gem which is documented at http://www.yoshidam.net/unicode.txt.
irb(main):001:0> require 'unicode'
=> true
irb(main):006:0> s = "abčd̪é"
=> "abčd̪é"
irb(main):007:0> s.chars.to_a
=> ["a", "b", "č", "d", "̪", "é"]
irb(main):009:0> Unicode.nfc(s).chars.to_a
=> ["a", "b", "č", "d", "̪", "é"]
irb(main):010:0> Unicode.nfd(s).chars.to_a
=> ["a", "b", "c", "̌", "d", "̪", "e", "́"]
irb(main):017:0> Unicode.text_elements(s)
=> ["a", "b", "č", "d̪", "é"]
Ruby2.0
str = "d̪"
char = str[/\p{M}/]
other = str[/\w/]

Resources