Split Unicode entities by graphemes - ruby

"d̪".chars.to_a
gives me
["d"," ̪"]
How do I get Ruby to split it by graphemes?
["d̪"]

Edit: As #michau's answer notes, Ruby 2.5 introduced the grapheme_clusters method, as well as each_grapheme_cluster if you just want to iterate/enumerate without necessarily creating an array.
In Ruby 2.0 or above you can use str.scan /\X/
> "d̪".scan /\X/
=> ["d̪"]
> "d̪d̪d̪".scan /\X/
=> ["d̪", "d̪", "d̪"]
# Let's get crazy:
> str = 'Z͑ͫ̓ͪ̂ͫ̽͏̴̙̤̞͉͚̯̞̠͍A̴̵̜̰͔ͫ͗͢L̠ͨͧͩ͘G̴̻͈͍͔̹̑͗̎̅͛́Ǫ̵̹̻̝̳͂̌̌͘!͖̬̰̙̗̿̋ͥͥ̂ͣ̐́́͜͞'
> str.length
=> 75
> str.scan(/\X/).length
=> 6
If you want to match the grapheme boundaries for any reason, you can use (?=\X) in your regex, for instance:
> "d̪".split /(?=\X)/
=> ["d̪"]
ActiveSupport (which is included in Rails) also has a way if you can't use \X for some reason:
ActiveSupport::Multibyte::Unicode.unpack_graphemes("d̪").map { |codes| codes.pack("U*") }

The following code should work in Ruby 2.5:
"d̪".grapheme_clusters # => ["d̪"]

Use Unicode::text_elements from unicode.gem which is documented at http://www.yoshidam.net/unicode.txt.
irb(main):001:0> require 'unicode'
=> true
irb(main):006:0> s = "abčd̪é"
=> "abčd̪é"
irb(main):007:0> s.chars.to_a
=> ["a", "b", "č", "d", "̪", "é"]
irb(main):009:0> Unicode.nfc(s).chars.to_a
=> ["a", "b", "č", "d", "̪", "é"]
irb(main):010:0> Unicode.nfd(s).chars.to_a
=> ["a", "b", "c", "̌", "d", "̪", "e", "́"]
irb(main):017:0> Unicode.text_elements(s)
=> ["a", "b", "č", "d̪", "é"]

Ruby2.0
str = "d̪"
char = str[/\p{M}/]
other = str[/\w/]

Related

How to split string with accented characters in ruby

Currently I got :
"mɑ̃ʒe".split('')
# => ["m", "ɑ", "̃", "ʒ", "e"]
I would like to get this result
"mɑ̃ʒe".split('')
# => ["m", "ã", "ʒ", "e"]
Use String#each_grapheme_cluster instead. For example:
"mɑ̃ʒe".each_grapheme_cluster.to_a
#=> ["m", "ɑ̃", "ʒ", "e"]

Ruby non consistent results with scanned string's length

I may not be having the whole picture here but I am getting inconsistent results with a calculation: I am trying to solve the run length encoding problem so that if you get an input string like "AAABBAAACCCAA" the encoding will be: "3A2B3A3C2A" so the functions is:
def encode(input)
res = ""
input.scan(/(.)\1*/i) do |match|
res << input[/(?<bes>#{match}+)/, "bes"].length.to_s << match[0].to_s
end
res
end
The results I am getting are:
irb(main):049:0> input = "AAABBBCCCDDD"
=> "AAABBBCCCDDD"
irb(main):050:0> encode(input)
(a) => "3A3B3C3D"
irb(main):051:0> input = "AAABBBCCCAAA"
=> "AAABBBCCCAAA"
irb(main):052:0> encode(input)
(b) => "3A3B3C3A"
irb(main):053:0> input = "AAABBBCCAAA"
=> "AAABBBCCAAA"
irb(main):054:0> encode(input)
(c) => "3A3B2C3A"
irb(main):055:0> input = "AAABBBCCAAAA"
=> "AAABBBCCAAAA"
irb(main):056:0> encode(input)
(d) => "3A3B2C3A"
irb(main):057:0> input = 'WWWWWWWWWWWWBWWWWWWWWWWWWBBBWWWWWWWWWWWWWWWWWWWWWWWWB'
=> "WWWWWWWWWWWWBWWWWWWWWWWWWBBBWWWWWWWWWWWWWWWWWWWWWWWWB"
irb(main):058:0> encode(input)
(e) => "12W1B12W1B12W1B"
As you can see, results (a) through (c) are correct, but results (d) and (e) are missing some repetitions and the resulting code is several letters short, can you give a hint as to where to check, please? (I am learning to use 'pry' right now)
Regular expressions are great, but they're not the golden hammer for every problem.
str = "AAABBAAACCCAA"
str.chars.chunk_while { |i, j| i == j }.map { |a| "#{a.size}#{a.first}" }.join
Breaking down what it does:
str = "AAABBAAACCCAA"
str.chars # => ["A", "A", "A", "B", "B", "A", "A", "A", "C", "C", "C", "A", "A"]
.chunk_while { |i, j| i == j } # => #<Enumerator: #<Enumerator::Generator:0x007fc1998ac020>:each>
.to_a # => [["A", "A", "A"], ["B", "B"], ["A", "A", "A"], ["C", "C", "C"], ["A", "A"]]
.map { |a| "#{a.size}#{a.first}" } # => ["3A", "2B", "3A", "3C", "2A"]
.join # => "3A2B3A3C2A"
to_a is there for illustration, but isn't necessary:
str = "AAABBAAACCCAA"
str.chars
.chunk_while { |i, j| i == j }
.map { |a| "#{a.size}#{a.first}" }
.join # => "3A2B3A3C2A"
how do you get to know such methods as Array#chunk_while? I am using Ruby 2.3.1 but cannot find it in the API docs, I mean, where is the compendium list of all the methods available? certainly not here ruby-doc.org/core-2.3.1/Array.html
Well, this is off-topic to the question but it's useful information to know:
Remember that Array includes the Enumerable module, which contains chunk_while. Use the search functionality of http://ruby-doc.org to find where things live. Also, get familiar with using ri at the command line, and try running gem server at the command-line to get the help for all the gems you've installed.
If you look at the Array documentation page, on the left you can see that Array has a parent class of Object, so it'll have the methods from Object, and that it also inherits from Enumerable, so it'll also pull in whatever is implemented in Enumerable.
You only get the count of the matched symbol repetitions that occur first. You need to perform a replacement within a gsub and pass the match object to a block where you can perform the necessary manipulations:
def encode(input)
input.gsub(/(.)\1*/) { |m| m.length.to_s << m[0] }
end
See the online Ruby test.
Results:
"AAABBBCCCDDD" => 3A3B3C3D
"AAABBBCCCAAA" => 3A3B3C3A
"AAABBBCCAAA" => 3A3B2C3A
"AAABBBCCAAAA" => 3A3B2C4A
"WWWWWWWWWWWWBWWWWWWWWWWWWBBBWWWWWWWWWWWWWWWWWWWWWWWWB" => 12W1B12W3B24W1B

Ruby - Convert a string concatanated by & into a hash

I have a string of the form
str="a=b&c=d&e=f&...."
Question is how do I convert the above str in below form
{ "a" => "b" , "c" => "d" , "e" => "f" .... }
You can use this method URI::decode_www_form.
require 'uri'
URI.decode_www_form "a=b&c=d&e=f"
# => [["a", "b"], ["c", "d"], ["e", "f"]]
URI.decode_www_form("a=b&c=d&e=f").to_h
# => {"a"=>"b", "c"=>"d", "e"=>"f"}
Just out of curiosity:
▶ q = "a=b&c=d&e=f"
▶ require 'json'
#⇒ true
▶ JSON.parse "{\"#{q}\"}".gsub /[=&]/, Hash('=' => '":"', '&' => '","')
#⇒ {
# "a" => "b",
# "c" => "d",
# "e" => "f"
#}
The straight way with splits:
q.split('&').map { |e| e.split('=') }.to_h
Most simple answer is:
hash = Rack::Utils.parse_query("a=b&c=d&e=f")
=> {"a"=>"b", "c"=>"d", "e"=>"f"} #output
and if you want to revert again then:
hash.to_query
str.split("&").inject({}) do |sum, e|
k, v = e.split("=")
sum.merge(k => v)
end
=> {"a"=>"b", "c"=>"d", "e"=>"f"}

How to split a string containing both delimiter and the escaped delimiter?

My string delimiter is ;. Delimiter is escaped in the string as \;. E.g.,
irb(main):018:0> s = "a;b;;d\\;e"
=> "a;b;;d\\;e"
irb(main):019:0> s.split(';')
=> ["a", "b", "", "d\\", "e"]
Could someone suggest me regex so the output of split would be ["a", "b", "", "d\\;e"]? I'm using Ruby 1.8.7
1.8.7 doesn't have negative lookbehind without Oniguruma (which may be compiled in).
1.9.3; yay:
> s = "a;b;c\\;d"
=> "a;b;c\\;d"
> s.split /(?<!\\);/
=> ["a", "b", "c\\;d"]
1.8.7 with Oniguruma doesn't offer a trivial split, but you can get match offsets and pull apart the substrings that way. I assume there's a better way to do this I'm not remembering:
> require 'oniguruma'
> re = Oniguruma::ORegexp.new "(?<!\\\\);"
> s = "hello;there\\;nope;yestho"
> re.match_all s
=> [#<MatchData ";">, #<MatchData ";">]
> mds = re.match_all s
=> [#<MatchData ";">, #<MatchData ";">]
> mds.collect {|md| md.offset}
=> [[5, 6], [17, 18]]
Other options include:
Splitting on ; and post-processing the results looking for trailing \\, or
Do a char-by-char loop and maintain some simple state and just split manually.
As #dave-newton answered, you could use negative lookbehind, but that isn't supported in 1.8. An alternative that will work in both 1.8 and 1.9, is to use String#scan instead of split, with a pattern accepting not (semicolon or backslash) or anychar prefixed by backlash:
$ irb
>> RUBY_VERSION
=> "1.8.7"
>> s = "a;b;c\\;d"
=> "a;b;c\\;d"
s.scan /(?:[^;\\]|\\.)+/
=> ["a", "b", "c\\;d"]

Ruby: How to turn a hash into HTTP parameters?

That is pretty easy with a plain hash like
{:a => "a", :b => "b"}
which would translate into
"a=a&b=b"
But what do you do with something more complex like
{:a => "a", :b => ["c", "d", "e"]}
which should translate into
"a=a&b[0]=c&b[1]=d&b[2]=e"
Or even worse, (what to do) with something like:
{:a => "a", :b => [{:c => "c", :d => "d"}, {:e => "e", :f => "f"}]
Thanks for the much appreciated help with that!
For basic, non-nested hashes, Rails/ActiveSupport has Object#to_query.
>> {:a => "a", :b => ["c", "d", "e"]}.to_query
=> "a=a&b%5B%5D=c&b%5B%5D=d&b%5B%5D=e"
>> CGI.unescape({:a => "a", :b => ["c", "d", "e"]}.to_query)
=> "a=a&b[]=c&b[]=d&b[]=e"
http://api.rubyonrails.org/classes/Object.html#method-i-to_query
If you are using Ruby 1.9.2 or later, you can use URI.encode_www_form if you don't need arrays.
E.g. (from the Ruby docs in 1.9.3):
URI.encode_www_form([["q", "ruby"], ["lang", "en"]])
#=> "q=ruby&lang=en"
URI.encode_www_form("q" => "ruby", "lang" => "en")
#=> "q=ruby&lang=en"
URI.encode_www_form("q" => ["ruby", "perl"], "lang" => "en")
#=> "q=ruby&q=perl&lang=en"
URI.encode_www_form([["q", "ruby"], ["q", "perl"], ["lang", "en"]])
#=> "q=ruby&q=perl&lang=en"
You'll notice that array values are not set with key names containing [] like we've all become used to in query strings. The spec that encode_www_form uses is in accordance with the HTML5 definition of application/x-www-form-urlencoded data.
Update: This functionality was removed from the gem.
Julien, your self-answer is a good one, and I've shameless borrowed from it, but it doesn't properly escape reserved characters, and there are a few other edge cases where it breaks down.
require "addressable/uri"
uri = Addressable::URI.new
uri.query_values = {:a => "a", :b => ["c", "d", "e"]}
uri.query
# => "a=a&b[0]=c&b[1]=d&b[2]=e"
uri.query_values = {:a => "a", :b => [{:c => "c", :d => "d"}, {:e => "e", :f => "f"}]}
uri.query
# => "a=a&b[0][c]=c&b[0][d]=d&b[1][e]=e&b[1][f]=f"
uri.query_values = {:a => "a", :b => {:c => "c", :d => "d"}}
uri.query
# => "a=a&b[c]=c&b[d]=d"
uri.query_values = {:a => "a", :b => {:c => "c", :d => true}}
uri.query
# => "a=a&b[c]=c&b[d]"
uri.query_values = {:a => "a", :b => {:c => "c", :d => true}, :e => []}
uri.query
# => "a=a&b[c]=c&b[d]"
The gem is 'addressable'
gem install addressable
No need to load up the bloated ActiveSupport or roll your own, you can use Rack::Utils.build_query and Rack::Utils.build_nested_query. Here's a blog post that gives a good example:
require 'rack'
Rack::Utils.build_query(
authorization_token: "foo",
access_level: "moderator",
previous: "index"
)
# => "authorization_token=foo&access_level=moderator&previous=index"
It even handles arrays:
Rack::Utils.build_query( {:a => "a", :b => ["c", "d", "e"]} )
# => "a=a&b=c&b=d&b=e"
Rack::Utils.parse_query _
# => {"a"=>"a", "b"=>["c", "d", "e"]}
Or the more difficult nested stuff:
Rack::Utils.build_nested_query( {:a => "a", :b => [{:c => "c", :d => "d"}, {:e => "e", :f => "f"}] } )
# => "a=a&b[][c]=c&b[][d]=d&b[][e]=e&b[][f]=f"
Rack::Utils.parse_nested_query _
# => {"a"=>"a", "b"=>[{"c"=>"c", "d"=>"d", "e"=>"e", "f"=>"f"}]}
Here's a short and sweet one liner if you only need to support simple ASCII key/value query strings:
hash = {"foo" => "bar", "fooz" => 123}
# => {"foo"=>"bar", "fooz"=>123}
query_string = hash.to_a.map { |x| "#{x[0]}=#{x[1]}" }.join("&")
# => "foo=bar&fooz=123"
Steal from Merb:
# File merb/core_ext/hash.rb, line 87
def to_params
params = ''
stack = []
each do |k, v|
if v.is_a?(Hash)
stack << [k,v]
else
params << "#{k}=#{v}&"
end
end
stack.each do |parent, hash|
hash.each do |k, v|
if v.is_a?(Hash)
stack << ["#{parent}[#{k}]", v]
else
params << "#{parent}[#{k}]=#{v}&"
end
end
end
params.chop! # trailing &
params
end
See http://noobkit.com/show/ruby/gems/development/merb/hash/to_params.html
class Hash
def to_params
params = ''
stack = []
each do |k, v|
if v.is_a?(Hash)
stack << [k,v]
elsif v.is_a?(Array)
stack << [k,Hash.from_array(v)]
else
params << "#{k}=#{v}&"
end
end
stack.each do |parent, hash|
hash.each do |k, v|
if v.is_a?(Hash)
stack << ["#{parent}[#{k}]", v]
else
params << "#{parent}[#{k}]=#{v}&"
end
end
end
params.chop!
params
end
def self.from_array(array = [])
h = Hash.new
array.size.times do |t|
h[t] = array[t]
end
h
end
end
I know this is an old question, but I just wanted to post this bit of code as I could not find a simple gem to do just this task for me.
module QueryParams
def self.encode(value, key = nil)
case value
when Hash then value.map { |k,v| encode(v, append_key(key,k)) }.join('&')
when Array then value.map { |v| encode(v, "#{key}[]") }.join('&')
when nil then ''
else
"#{key}=#{CGI.escape(value.to_s)}"
end
end
private
def self.append_key(root_key, key)
root_key.nil? ? key : "#{root_key}[#{key.to_s}]"
end
end
Rolled up as gem here: https://github.com/simen/queryparams
{:a=>"a", :b=>"b", :c=>"c"}.map{ |x,v| "#{x}=#{v}" }.reduce{|x,v| "#{x}&#{v}" }
"a=a&b=b&c=c"
Here's another way. For simple queries.
The best approach it is to use Hash.to_params which is the one working fine with arrays.
{a: 1, b: [1,2,3]}.to_param
"a=1&b[]=1&b[]=2&b[]=3"
require 'uri'
class Hash
def to_query_hash(key)
reduce({}) do |h, (k, v)|
new_key = key.nil? ? k : "#{key}[#{k}]"
v = Hash[v.each_with_index.to_a.map(&:reverse)] if v.is_a?(Array)
if v.is_a?(Hash)
h.merge!(v.to_query_hash(new_key))
else
h[new_key] = v
end
h
end
end
def to_query(key = nil)
URI.encode_www_form(to_query_hash(key))
end
end
2.4.2 :019 > {:a => "a", :b => "b"}.to_query_hash(nil)
=> {:a=>"a", :b=>"b"}
2.4.2 :020 > {:a => "a", :b => "b"}.to_query
=> "a=a&b=b"
2.4.2 :021 > {:a => "a", :b => ["c", "d", "e"]}.to_query_hash(nil)
=> {:a=>"a", "b[0]"=>"c", "b[1]"=>"d", "b[2]"=>"e"}
2.4.2 :022 > {:a => "a", :b => ["c", "d", "e"]}.to_query
=> "a=a&b%5B0%5D=c&b%5B1%5D=d&b%5B2%5D=e"
If you are in the context of a Faraday request, you can also just pass the params hash as the second argument and faraday takes care of making proper param URL part out of it:
faraday_instance.get(url, params_hsh)
I like using this gem:
https://rubygems.org/gems/php_http_build_query
Sample usage:
puts PHP.http_build_query({"a"=>"b","c"=>"d","e"=>[{"hello"=>"world","bah"=>"black"},{"hello"=>"world","bah"=>"black"}]})
# a=b&c=d&e%5B0%5D%5Bbah%5D=black&e%5B0%5D%5Bhello%5D=world&e%5B1%5D%5Bbah%5D=black&e%5B1%5D%5Bhello%5D=world
2.6.3 :001 > hash = {:a => "a", :b => ["c", "d", "e"]}
=> {:a=>"a", :b=>["c", "d", "e"]}
2.6.3 :002 > hash.to_a.map { |x| "#{x[0]}=#{x[1].class == Array ? x[1].join(",") : x[1]}"
}.join("&")
=> "a=a&b=c,d,e"

Resources