Split an array of hashes in an array into slices - ruby

Lets say you have a string
initial_message = "My dear cousin bill!"
I put this string of N characters in an array of hashes (where each letter is the key and the value is A = 0 , B = 1, C = 2.. etc).
hsh_letter_values = Hash[('a'..'z').zip (0..25).to_a] #Map letters to numbers in a hash
clean_message = initial_message.tr('^a-zA-Z0-9','').downcase #remove non-letters
char_map = clean_message.each_char.map { |i| { i => hsh_letter_values[i] } } #map each letter of message to corresponding number
Then I split the char_map into slices of 16.
char_split_map = char_map.each_slice(16).to_a
I want to split each 16 character slice into slices of 4, while keeping the hashes in the same order.
The outcome should look like:
[[[{"m"=>12}, {"y"=>24}, {"d"=>3}, {"e"=>4}],[{"a"=>0}, {"r"=>17}, {"c"=>2}, {"o"=>14}], [{"u"=>20}, {"s"=>18}, {"i"=>8}, {"n"=>13}], [{"b"=>1}, {"i"=>8}, {"l"=>11}, {"l"=>11}]]
I am planning on adding the values of each letter from each column to get four sums (C1,C2,C3,C4)
So for the first column it would be 12+0+20+1.
This is what I have so far http://repl.it/2cd/1.
Any help on what im doing wrong or a better way to handle this situation?

One way, starting with the message:
msg = "My dear cousin bill!"
arr = msg.downcase.gsub(/[^a-z]/,'').chars.each_slice(4).to_a
#=> [["m", "y", "d", "e"],
# ["a", "r", "c", "o"],
# ["u", "s", "i", "n"],
# ["b", "i", "l", "l"]]
4.times.map { |i| arr.reduce(0) { |t,a| t + (a[i]||?a).ord-?a.ord } }
#=> [33, 67, 24, 42]
msg = "My dearest cousin bill!"
arr = msg.downcase.gsub(/[^a-z]/,'').chars.each_slice(4).to_a
#=> [["m", "y", "d", "e"],
# ["a", "r", "e", "s"],
# ["t", "c", "o", "u"],
# ["s", "i", "n", "b"],
# ["i", "l", "l"]]
4.times.map { |i| arr.reduce(0) { |t,a| t + (a[i]||?a).ord-?a.ord } }
#=>[57, 62, 45, 43]

I would probably go with a slightly different approach:
initial_message = "My dear cousin bill!"
chars = initial_message.tr('^a-zA-Z0-9','').downcase.chars
char_map = ->(char) { char.ord - 97 }
results = chars.each_slice(4).each_slice(4).map do |array|
array.transpose.map {|column| column.reduce(0) {|res, letter| res + char_map[letter]} }
end
results.inspect => '[[33, 67, 24, 42]]'
This is not hitting the intermediate step you described in your question, however is probably a better way to achieve your final result.

Related

How to find two elements of the same array that contain all vowels

I want to iterate a given array, for example:
["goat", "action", "tear", "impromptu", "tired", "europe"]
I want to look at all possible pairs.
The desired output is a new array, which contains all pairs, that combined contain all vowels. Also those pairs should be concatenated as one element of the output array:
["action europe", "tear impromptu"]
I tried the following code, but got an error message:
No implicit conversion of nil into string.
def all_vowel_pairs(words)
pairs = []
(0..words.length).each do |i| # iterate through words
(0..words.length).each do |j| # for every word, iterate through words again
pot_pair = words[i].to_s + words[j] # build string from pair
if check_for_vowels(pot_pair) # throw string to helper-method.
pairs << words[i] + " " + words[j] # if gets back true, concatenade and push to output array "pairs"
end
end
end
pairs
end
# helper-method to check for if a string has all vowels in it
def check_for_vowels(string)
vowels = "aeiou"
founds = []
string.each_char do |char|
if vowels.include?(char) && !founds.include?(char)
founds << char
end
end
if founds.length == 5
return true
end
false
end
The following code is intended to provide an efficient way to construct the desired array when the number of words is large. Note that, unlike the other answers, it does not make use of the method Array#combination.
The first part of the section Explanation (below) provides an overview of the approach taken by the algorithm. The details are then filled in.
Code
require 'set'
VOWELS = ["a", "e", "i", "o", "u"]
VOWELS_SET = VOWELS.to_set
def all_vowel_pairs(words)
h = words.each_with_object({}) {|w,h| (h[(w.chars & VOWELS).to_set] ||= []) << w}
h.each_with_object([]) do |(k,v),a|
vowels_needed = VOWELS_SET-k
h.each do |kk,vv|
next unless kk.superset?(vowels_needed)
v.each {|w1| vv.each {|w2| a << "%s %s" % [w1, w2] if w1 < w2}}
end
end
end
Example
words = ["goat", "action", "tear", "impromptu", "tired", "europe", "hear"]
all_vowel_pairs(words)
#=> ["action europe", "hear impromptu", "impromptu tear"]
Explanation
For the given example the steps are as follows.
VOWELS_SET = VOWELS.to_set
#=> #<Set: {"a", "e", "i", "o", "u"}>
h = words.each_with_object({}) {|w,h| (h[(w.chars & VOWELS).to_set] ||= []) << w}
#=> {#<Set: {"o", "a"}>=>["goat"],
# #<Set: {"a", "i", "o"}>=>["action"],
# #<Set: {"e", "a"}>=>["tear", "hear"],
# #<Set: {"i", "o", "u"}>=>["impromptu"],
# #<Set: {"i", "e"}>=>["tired"],
# #<Set: {"e", "u", "o"}>=>["europe"]}
It is seen that the keys of h are subsets of the five vowels. The values are arrays of elements of words (words) that contain the vowels given by the key and no others. The values therefore collectively form a partition of words. When the number of words is large one would expect h to have 31 keys (2**5 - 1).
We now loop through the key-value pairs of h. For each, with key k and value v, the set of missing vowels (vowels_needed) is determined, then we loop through those keys-value pairs [kk, vv] of h for which kk is a superset of vowels_needed. All combinations of elements of v and vv are then added to the array being returned (after an adjustment to avoid double-counting each pair of words).
Continuing,
enum = h.each_with_object([])
#=> #<Enumerator: {#<Set: {"o", "a"}>=>["goat"],
# #<Set: {"a", "i", "o"}>=>["action"],
# ...
# #<Set: {"e", "u", "o"}>=>["europe"]}:
# each_with_object([])>
The first value is generated by enum and passed to the block, and the block variables are assigned values:
(k,v), a = enum.next
#=> [[#<Set: {"o", "a"}>, ["goat"]], []]
See Enumerator#next.
The individual variables are assigned values by array decomposition:
k #=> #<Set: {"o", "a"}>
v #=> ["goat"]
a #=> []
The block calculations are now performed.
vowels_needed = VOWELS_SET-k
#=> #<Set: {"e", "i", "u"}>
h.each do |kk,vv|
next unless kk.superset?(vowels_needed)
v.each {|w1| vv.each {|w2| a << "%s %s" % [w1, w2] if w1 < w2}}
end
The word "goat" (v) has vowels "o" and "a", so it can only be matched with words that contain vowels "e", "i" and "u" (and possibly "o" and/or "a"). The expression
next unless kk.superset?(vowels_needed)
skips those keys of h (kk) that are not supersets of vowels_needed. See Set#superset?.
None of the words in words contain "e", "i" and "u" so the array a is unchanged.
The next element is now generated by enum, passed to the block and the block variables are assigned values:
(k,v), a = enum.next
#=> [[#<Set: {"a", "i", "o"}>, ["action"]], []]
k #=> #<Set: {"a", "i", "o"}>
v #=> ["action"]
a #=> []
The block calculation begins:
vowels_needed = VOWELS_SET-k
#=> #<Set: {"e", "u"}>
We see that h has only one key-value pair for which the key is a superset of vowels_needed:
kk = %w|e u o|.to_set
#=> #<Set: {"e", "u", "o"}>
vv = ["europe"]
We therefore execute:
v.each {|w1| vv.each {|w2| a << "%s %s" % [w1, w2] if w1 < w2}}
which adds one element to a:
a #=> ["action europe"]
The clause if w1 < w2 is to ensure that later in the calculations "europe action" is not added to a.
If v (words containing 'a', 'i' and 'u') and vv (words containing 'e', 'u' and 'o') had instead been:
v #=> ["action", "notification"]
vv #=> ["europe", "route"]
we would have added "action europe", "action route" and "notification route" to a. (”europe notification” would be added later, when k #=> #<Set: {"e", "u", "o"}.)
Benchmark
I benchmarked my method against others suggested using #theTinMan's Fruity benchmark code. The only differences were in the array of words to be tested and the addition of my method to the benchmark, which I named cary. For the array of words to be considered I selected 600 words at random from a file of English words on my computer:
words = IO.readlines('/usr/share/dict/words', chomp: true).sample(600)
words.first 10
#=> ["posadaship", "explosively", "expensilation", "conservatively", "plaiting",
# "unpillared", "intertwinement", "nonsolidified", "uraemic", "underspend"]
This array was found to contain 46,436 pairs of words containing all five vowels.
The results were as shown below.
compare {
_viktor { viktor(words) }
_ttm1 { ttm1(words) }
_ttm2 { ttm2(words) }
_ttm3 { ttm3(words) }
_cary { cary(words) }
}
Running each test once. Test will take about 44 seconds.
_cary is faster than _ttm3 by 5x ± 0.1
_ttm3 is faster than _viktor by 50.0% ± 1.0%
_viktor is faster than _ttm2 by 30.000000000000004% ± 1.0%
_ttm2 is faster than _ttm1 by 2.4x ± 0.1
I then compared cary with ttm3 for 1,000 randomly selected words. This array was found to contain 125,068 pairs of words containing all five vowels. That result was as follows:
Running each test once. Test will take about 19 seconds.
_cary is faster than _ttm3 by 3x ± 1.0
To get a feel for the variability of the benchmark I ran this last comparison twice more, each with a new random selection of 1,000 words. That gave me the following results:
Running each test once. Test will take about 17 seconds.
_cary is faster than _ttm3 by 5x ± 1.0
Running each test once. Test will take about 18 seconds.
_cary is faster than _ttm3 by 4x ± 1.0
It is seen the there is considerable variation among the samples.
You said pairs so I assume it's a combination of two elements. I've made a combination of each two elements in the array using the #combination method. Then I #select-ed only those pairs that contain all vowels once they're joined. Finally, I made sure to join those pairs :
["goat", "action", "tear", "impromptu", "tired", "europe"]
.combination(2)
.select { |c| c.join('') =~ /\b(?=\w*?a)(?=\w*?e)(?=\w*?i)(?=\w*?o)(?=\w*?u)[a-zA-Z]+\b/ }
.map{ |w| w.join(' ') }
#=> ["action europe", "tear impromptu"]
The regex is from "What is the regex to match the words containing all the vowels?".
Starting similarly to Viktor's, I'd use a simple test to see what vowels exist in the words and compare to whether they match "aeiou" after stripping duplicates and sorting them:
def ttm1(ary)
ary.combination(2).select { |a|
a.join.scan(/[aeiou]/).uniq.sort.join == 'aeiou'
}.map { |a| a.join(' ') }
end
ttm1(words) # => ["action europe", "tear impromptu"]
Breaking it down so you can see what's happening.
["goat", "action", "tear", "impromptu", "tired", "europe"] # => ["goat", "action", "tear", "impromptu", "tired", "europe"]
.combination(2)
.select { |a| a # => ["goat", "action"], ["goat", "tear"], ["goat", "impromptu"], ["goat", "tired"], ["goat", "europe"], ["action", "tear"], ["action", "impromptu"], ["action", "tired"], ["action", "europe"], ["tear", "impromptu"], ["tear", "tired"], ["tear", "europe"], ["impromptu", "tired"], ["impromptu", "europe"], ["tired", "europe"]
.join # => "goataction", "goattear", "goatimpromptu", "goattired", "goateurope", "actiontear", "actionimpromptu", "actiontired", "actioneurope", "tearimpromptu", "teartired", "teareurope", "impromptutired", "impromptueurope", "tiredeurope"
.scan(/[aeiou]/) # => ["o", "a", "a", "i", "o"], ["o", "a", "e", "a"], ["o", "a", "i", "o", "u"], ["o", "a", "i", "e"], ["o", "a", "e", "u", "o", "e"], ["a", "i", "o", "e", "a"], ["a", "i", "o", "i", "o", "u"], ["a", "i", "o", "i", "e"], ["a", "i", "o", "e", "u", "o", "e"], ["e", "a", "i", "o", "u"], ["e", "a", "i", "e"], ["e", "a", "e", "u", "o", "e"], ["i", "o", "u", "i", "e"], ["i", "o", "u", "e", "u", "o", "e"], ["i", "e", "e", "u", "o", "e"]
.uniq # => ["o", "a", "i"], ["o", "a", "e"], ["o", "a", "i", "u"], ["o", "a", "i", "e"], ["o", "a", "e", "u"], ["a", "i", "o", "e"], ["a", "i", "o", "u"], ["a", "i", "o", "e"], ["a", "i", "o", "e", "u"], ["e", "a", "i", "o", "u"], ["e", "a", "i"], ["e", "a", "u", "o"], ["i", "o", "u", "e"], ["i", "o", "u", "e"], ["i", "e", "u", "o"]
.sort # => ["a", "i", "o"], ["a", "e", "o"], ["a", "i", "o", "u"], ["a", "e", "i", "o"], ["a", "e", "o", "u"], ["a", "e", "i", "o"], ["a", "i", "o", "u"], ["a", "e", "i", "o"], ["a", "e", "i", "o", "u"], ["a", "e", "i", "o", "u"], ["a", "e", "i"], ["a", "e", "o", "u"], ["e", "i", "o", "u"], ["e", "i", "o", "u"], ["e", "i", "o", "u"]
.join == 'aeiou' # => false, false, false, false, false, false, false, false, true, true, false, false, false, false, false
} # => [["action", "europe"], ["tear", "impromptu"]]
Looking at the code it was jumping through hoops to find whether all the vowels exist. Every time it checked it had to step through many methods before determining whether all the vowels were found; In other words it couldn't short-circuit and fail until the very end which isn't good.
This code will:
def ttm2(ary)
ary.combination(2).select { |a|
str = a.join
str[/a/] && str[/e/] && str[/i/] && str[/o/] && str[/u/]
}.map { |a| a.join(' ') }
end
ttm2(words) # => ["action europe", "tear impromptu"]
But I don't like using the regular expression engine this way as it's slower than doing a direct lookup, which lead to:
def ttm3(ary)
ary.combination(2).select { |a|
str = a.join
str['a'] && str['e'] && str['i'] && str['o'] && str['u']
}.map { |a| a.join(' ') }
end
Here's the benchmark:
require 'fruity'
words = ["goat", "action", "tear", "impromptu", "tired", "europe"]
def viktor(ary)
ary.combination(2)
.select { |c| c.join('') =~ /\b(?=\w*?a)(?=\w*?e)(?=\w*?i)(?=\w*?o)(?=\w*?u)[a-zA-Z]+\b/ }
.map{ |w| w.join(' ') }
end
viktor(words) # => ["action europe", "tear impromptu"]
def ttm1(ary)
ary.combination(2).select { |a|
a.join.scan(/[aeiou]/).uniq.sort.join == 'aeiou'
}.map { |a| a.join(' ') }
end
ttm1(words) # => ["action europe", "tear impromptu"]
def ttm2(ary)
ary.combination(2).select { |a|
str = a.join
str[/a/] && str[/e/] && str[/i/] && str[/o/] && str[/u/]
}.map { |a| a.join(' ') }
end
ttm2(words) # => ["action europe", "tear impromptu"]
def ttm3(ary)
ary.combination(2).select { |a|
str = a.join
str['a'] && str['e'] && str['i'] && str['o'] && str['u']
}.map { |a| a.join(' ') }
end
ttm3(words) # => ["action europe", "tear impromptu"]
compare {
_viktor { viktor(words) }
_ttm1 { ttm1(words) }
_ttm2 { ttm2(words) }
_ttm3 { ttm3(words) }
}
With the results:
# >> Running each test 256 times. Test will take about 1 second.
# >> _ttm3 is similar to _viktor
# >> _viktor is similar to _ttm2
# >> _ttm2 is faster than _ttm1 by 2x ± 0.1
Now, because this looks so much like a homework assignment, it's important to understand that schools are aware of Stack Overflow, and they look for students asking for help, so you probably don't want to reuse this code, especially not verbatim.
Your code contains two errors, one of which is causing the error message.
(0..words.length) loops from 0 to 6 . words[6] however does not exist (arrays are zero-based), so you get nil. Replacing by (0..words.length-1) (twice) should take care of that.
You will get every correct result twice, once as "action europe" and once as "europe action". This is caused by looping too much, going two times over every combination. Replace the second loop from (0..words.length-1) to (i..words.length-1).
This cumbersome bookkeeping of indexes is boring and leads to mistakes very often. This is why Ruby programmers often prefer more hassle-free methods (like combination as in other answers), avoiding indexes altogether.

Why can't I sort an array of strings by `count`?

With this code:
line = ("Ignore punctuation, please :)")
string = line.strip.downcase.split(//)
string.select! {|x| /[a-z]/.match(x) }
string.sort_by!{ |x| string.count(x)}
the result is:
["r", "g", "s", "l", "c", "o", "o", "p", "u", "i", "t", "u", "a", "t", "i", "a", "p", "n", "e", "e", "n", "n", "e"]
Does sorting by count not work in this case? Why? Is there a better way to isolate the words by frequency?
By your comment, I suppose that you want to sort characters by frequency and alphabetically. When the only sort_by! criteria is string.count(x), frequency groups with the same number of characters can appear mixed with each other. To sort each group alphabetically you have to add a second criteria in the sort_by! method:
line = ("Ignore punctuation, please :)")
string = line.strip.downcase.split(//)
string.select! {|x| /[a-z]/.match(x) }
string.sort_by!{ |x| [string.count(x), x]}
Then the output will be
["c", "g", "l", "r", "s", "a", "a", "i", "i", "o", "o", "p", "p", "t", "t", "u", "u", "e", "e", "e", "n", "n", "n"]
Let's look at your code line-by-line.
line = ("Ignore punctuation, please :)")
s = line.strip.downcase
#=> "ignore punctuation, please :)"
There's no particular reason to strip here, as you will be removing spaces and punctuation later anyway.
string = s.split(//)
#=> ["i", "g", "n", "o", "r", "e", " ", "p", "u", "n", "c", "t",
# "u", "a", "t", "i", "o", "n", ",", " ", "p", "l", "e", "a",
# "s", "e", " ", ":", ")"]
You've chosen to split the sentence into characters, which is fine, but as I'll mention at the end, you could just use String methods. In any case,
string = s.chars
does the same thing and is arguably more clear. What you have now is an array named string. Isn't that a bit confusing? Let's instead call it arr:
arr = s.chars
(One often sees s and str for names of strings, a and arr for names of arrays, h and hash for names of hashes, and so on.)
arr.select! {|x| /[a-z]/.match(x) }
#=> ["i", "g", "n", "o", "r", "e", "p", "u", "n", "c", "t", "u",
# "a", "t", "i", "o", "n", "p", "l", "e", "a", "s", "e"]
Now you've eliminated all but lowercase letters. You could also write that:
arr.select! {|x| s =~ /[a-z]/ }
or
arr.select! {|x| s[/[a-z]/] }
You are now ready to sort.
arr.sort_by!{ |x| arr.count(x) }
#=> ["l", "g", "s", "c", "r", "i", "p", "u", "a", "o", "t", "p",
# "a", "t", "i", "o", "u", "n", "n", "e", "e", "n", "e"]
This is OK, but it's not good practice to be sorting an array in place and counting the frequency of its elements at the same time. Better would be:
arr1 = arr.sort_by{ |x| arr.count(x) }
which gives the same ordering. Is the resulting sorted array correct? Let's count the number of times each letter appears in the string.
I will create a hash whose keys are the unique elements of arr and whose values are the number of times the associated key appears in arr. There are a few ways to do this. A simple but not very efficient way is as follows:
h = {}
a = arr.uniq
#=> ["l", "g", "s", "c", "r", "i", "p", "u", "a", "o", "t", "n", "e"]
a.each { |c| h[c] = arr.count(c) }
h #=> {"l"=>1, "g"=>1, "s"=>1, "c"=>1, "r"=>1, "i"=>2, "p"=>2,
# "u"=>2, "a"=>2, "o"=>2, "t"=>2, "n"=>3, "e"=>3}
This would normally be written:
h = arr.uniq.each_with_object({}) { |c,h| h[c] = arr.count(c) }
The elements of h are in increasing order of value, but that's just coincidence. To ensure they are in that order (to make it easier to see the order), we would need to construct an array, sort it, then convert it to a hash:
a = arr.uniq.map { |c| [c, arr.count(c)] }
#=> [["l", 1], ["g", 1], ["s", 1], ["c", 1], ["r", 1], ["a", 2], ["p", 2],
# ["u", 2], ["i", 2], ["o", 2], ["t", 2], ["n", 3], ["e", 3]]
a = a.sort_by { |_,count| count }
#=> [["l", 1], ["g", 1], ["s", 1], ["c", 1], ["r", 1], ["a", 2], ["t", 2],
# ["u", 2], ["i", 2], ["o", 2], ["p", 2], ["n", 3], ["e", 3]]
h = Hash[a]
#=> {"l"=>1, "g"=>1, "s"=>1, "c"=>1, "r"=>1, "i"=>2, "t"=>2,
# "u"=>2, "a"=>2, "o"=>2, "p"=>2, "n"=>3, "e"=>3}
One would normally see this written:
h = Hash[arr.uniq.map { |c| [c, arr.count(c)] }.sort_by(&:last)]
or, in Ruby v2.0+:
h = arr.uniq.map { |c| [c, arr.count(c)] }.sort_by(&:last).to_h
Note that, prior to Ruby 1.9, there was no concept of key ordering in hashes.
The values of h's key-value pairs show that your sort is correct. It is not, however, very efficient. That's because in:
arr.sort_by { |x| arr.count(x) }
you repeatedly traverse arr, counting frequencies of elements. It's better to construct the hash above:
h = arr.uniq.each_with_object({}) { |c,h| h[c] = arr.count(c) }
before performing the sort, then:
arr.sort_by { |x| h[x] }
As an aside, let me mention a more efficient way to construct the hash h, one which requires only a single pass through arr:
h = Hash.new(0)
arr.each { |x| h[x] += 1 }
h #=> {"l"=>1, "g"=>1, "s"=>1, "c"=>1, "r"=>1, "a"=>2, "p"=>2,
# "u"=>2, "i"=>2, "o"=>2, "t"=>2, "n"=>3, "e"=>3}
or, more succinctly:
h = arr.each_with_object(Hash.new(0)) { |x,h| h[x] += 1 }
Here h is called a counting hash:
h = Hash.new(0)
creates an empty hash whose default value is zero. This means that if h does not have a key k, h[k] will return zero. The abbreviated assignment h[c] += 1 expands to:
h[c] = h[c] + 1
and if h does not have a key c, the default value is assigned to h[c] on the right side:
h[c] = 0 + 1 #=> 1
but the next time c is encountered:
h[c] = h[c] + 1
#=> 1 + 1 => 2
Lastly, let's start over and do as much as we can with String methods:
line = ("Ignore punctuation, please :)")
s = line.strip.downcase.gsub(/./) { |c| (c =~ /[a-z]/) ? c : '' }
#=> "ignorepunctuationplease"
h = s.each_char.with_object(Hash.new(0)) { |c,h| h[c] += 1 }
#=> {"i"=>2, "g"=>1, "n"=>3, "o"=>2, "r"=>1, "e"=>3, "p"=>2,
# "u"=>2, "c"=>1, "t"=>2, "a"=>2, "l"=>1, "s"=>1}
s.each_char.sort_by { |c| h[c] }
#=> ["l", "g", "s", "c", "r", "i", "p", "u", "a", "o", "t", "p",
# "a", "t", "i", "o", "u", "n", "n", "e", "e", "n", "e"]

How to join every X amount of characters together in an Array - Ruby

If I want to join every X amount of letters together in an array how could I implement this?
In this case I want to join every two letters together
Input: array = ["b", "i", "e", "t", "r", "o"]
Output: array = ["bi", "et", "ro"]
each_slice (docs):
arr = 'bietro'.split ''
# grab each slice of 2 elements
p arr.each_slice(2).to_a
#=> [["b", "i"], ["e", "t"], ["r", "o"]]
# map `join' over each of the slices
p arr.each_slice(2).map(&:join)
#=> ["bi", "et", "ro"]
#Doorknow shows the best way, but here are two (among many, many) other ways:
def bunch_em(arr,n)
((arr.size+n-1)/n).times.map { |i| arr.slice(i*n,n).join }
end
arr = ["b", "i", "e", "t", "r", "o"]
bunch_em(arr,2) #=> ["bi", "et", "ro"]

How to match bar, b-a-r, b--a--r etc in a string by Regexp

Given a string, I want to find a word bar, b-a-r, b--a--r etc. where - can be any letter. But interval between letters must be the same.
All letters are lower case and there is no gap betweens.
For example bar, beayr, qbowarprr, wbxxxxxayyyyyrzzz should match this.
I tried /b[a-z]*a[a-z]*r/ but this matches bxar which is wrong.
I am wondering if I achieve this with regexp?
Here's is one way to get all matches.
Code
def all_matches_with_spacers(word, str)
word_size = word.size
word_arr = word.chars
str_arr = str.chars
(0..(str.size - word_size)/(word_size-1)).each_with_object([]) do |n, arr|
regex = Regexp.new(word_arr.join(".{#{n}}"))
str_arr.each_cons(word_size + n * (word_size - 1))
.map(&:join)
.each { |substring| arr << substring if substring =~ regex }
end
end
This requires word.size > 1.
Example
all_matches_with_spacers('bar', 'bar') #=> ["bar"]
all_matches_with_spacers('bar', 'beayr') #=> ["beayr"]
all_matches_with_spacers('bar', 'qbowarprr') #=> ["bowarpr"]
all_matches_with_spacers('bar', 'wbxxxxxayyyyyrzzz') #=> ["bxxxxxayyyyyr"]
all_matches_with_spacers('bobo', 'bobobocbcbocbcobcodbddoddbddobddoddbddob')
#=> ["bobo", "bobo", "bddoddbddo", "bddoddbddo"]
Explanation
Suppose
word = 'bobo'
str = 'bobobocbcbocbcobcodbddoddbddobddoddbddob'
then
word_size = word.size #=> 4
word_arr = word.chars #=> ["b", "o", "b", "o"]
str_arr = str.chars
#=> ["b", "o", "b", "o", "b", "o", "c", "b", "c", "b", "o", "c", "b", "c",
# "o", "b", "c", "o", "d", "b", "d", "d", "o", "d", "d", "b", "d", "d",
# "o", "b", "d", "d", "o", "d", "d", "b", "d", "d", "o", "b"]
If n is the number of spacers between each letter of word, we require
word.size + n * (word.size - 1) <= str.size
Hence (since str.size => 40),
n <= (str.size - word_size)/(word_size-1) #=> (40-4)/(4-1) => 12
We therefore will iterate over zero to 12 spacers:
(0..12).each_with_object([]) do |n, arr| .. end
Enumerable#each_with_object creates an initially-empty array denoted by the block variable arr. The first value passed to block is zero (spacers), assigned to the block variable n.
We then have
regex = Regexp.new(word_arr.join(".{#{0}}")) #=> /b.{0}o.{0}b.{0}o/
which is the same as /bar/. word with n spacers has length
word_size + n * (word_size - 1) #=> 19
To extract all sub-arrays of str_arr with this length, we invoke:
str_arr.each_cons(word_size + n * (word_size - 1))
Here, with n = 0, this is:
enum = str_arr.each_cons(4)
#=> #<Enumerator: ["b", "o", "b", "o", "b", "o",...,"b"]:each_cons(4)>
This enumerator will pass the following into its block:
enum.to_a
#=> [["b", "o", "b", "o"], ["o", "b", "o", "b"], ["b", "o", "b", "o"],
# ["o", "b", "o", "c"], ["b", "o", "c", "b"], ["o", "c", "b", "c"],
# ["c", "b", "c", "b"], ["b", "c", "b", "o"], ["c", "b", "o", "c"],
# ["b", "o", "c", "b"], ["o", "c", "b", "c"], ["c", "b", "c", "o"],
# ["b", "c", "o", "b"], ["c", "o", "b", "c"], ["o", "b", "c", "o"]]
We next convert these to strings:
ar = enum.map(&:join)
#=> ["bobo", "obob", "bobo", "oboc", "bocb", "ocbc", "cbcb", "bcbo",
# "cboc", "bocb", "ocbc", "cbco", "bcob", "cobc", "obco"]
and add each (assigned to the block variable substring) to the array arr for which:
substring =~ regex
ar.each { |substring| arr << substring if substring =~ regex }
arr => ["bobo", "bobo"]
Next we increment the number of spacers to n = 1. This has the following effect:
regex = Regexp.new(word_arr.join(".{#{1}}")) #=> /b.{1}o.{1}b.{1}o/
str_arr.each_cons(4 + 1 * (4 - 1)) #=> str_arr.each_cons(7)
so we now examine the strings
ar = str_arr.each_cons(7).map(&:join)
#=> ["boboboc", "obobocb", "bobocbc", "obocbcb", "bocbcbo", "ocbcboc",
# "cbcbocb", "bcbocbc", "cbocbco", "bocbcob", "ocbcobc", "cbcobco",
# "bcobcod", "cobcodb", "obcodbd", "bcodbdd", "codbddo", "odbddod",
# "dbddodd", "bddoddb", "ddoddbd", "doddbdd", "oddbddo", "ddbddob",
# "dbddobd", "bddobdd", "ddobddo", "dobddod", "obddodd", "bddoddb",
# "ddoddbd", "doddbdd", "oddbddo", "ddbddob"]
ar.each { |substring| arr << substring if substring =~ regex }
There are no matches with one spacer, so arr remains unchanged:
arr #=> ["bobo", "bobo"]
For n = 2 spacers:
regex = Regexp.new(word_arr.join(".{#{2}}")) #=> /b.{2}o.{2}b.{2}o/
str_arr.each_cons(4 + 2 * (4 - 1)) #=> str_arr.each_cons(10)
ar = str_arr.each_cons(10).map(&:join)
#=> ["bobobocbcb", "obobocbcbo", "bobocbcboc", "obocbcbocb", "bocbcbocbc",
# "ocbcbocbco", "cbcbocbcob", "bcbocbcobc", "cbocbcobco", "bocbcobcod",
# ...
# "ddoddbddob"]
ar.each { |substring| arr << substring if substring =~ regex }
arr #=> ["bobo", "bobo", "bddoddbddo", "bddoddbddo"]
No matches are found for more than two spacers, so the method returns
["bobo", "bobo", "bddoddbddo", "bddoddbddo"]
For reference, there is a beautiful solution to the overall problem that is available in regex flavors that allow a capturing group to refer to itself:
^[^b]*bar|b(?:[^a](?=[^a]*a(\1?+.)))+a\1r
Sadly, Ruby doesn't allow this.
The interesting bit is on the right side of the alternation. After matching the initial b, we define a non-capturing group for the characters between b and a. This group will be repeated with the +. Between the a and r, we will inject capture group 1 with \1`. This group was captured one character at a time, overwriting itself with each pass, as each character between b and a was added.
See Quantifier Capture where the solution was demonstrated by #CasimiretHippolyte who refers to the idea behind the technique the "qtax trick".

How to find all longest words in a string?

If I have a string with no spaces in it, just a concatenation like "hellocarworld", I want to get back an array of the largest dictionary words. so I would get ['hello','car','world']. I would not get back words such as 'a' because that belongs in 'car'.
The dictionary words can come from anywhere such as the dictionary on unix:
words = File.readlines("/usr/share/dict/words").collect{|x| x.strip}
string= "thishasmanywords"
How would you go about doing this?
I would suggest the following.
Code
For a given string and dictionary, dict:
string_arr = string.chars
string_arr.size.downto(1).with_object([]) { |n,arr|
string_arr.each_cons(n) { |a|
word = a.join
arr << word if (dict.include?(word) && !arr.any? {|w| w.include?(word) })}}
Examples
dict = File.readlines("/usr/share/dict/words").collect{|x| x.strip}
string = "hellocarworld"
#=> ["hello", "world", "loca", "car"]
string= "thishasmanywords"
#=> ["this", "hish", "many", "word", "sha", "sma", "as"]
"loca" is the plural of "locus". I'd never heard of "hish", "sha" or "sma". They all appear to be slang words, as I could only find them in something called the "Urban Dictonary".
Explanation
string_arr = "hellocarworld".chars
#=> ["h", "e", "l", "l", "o", "c", "a", "r", "w", "o", "r", "l", "d"]
string_arr.size
#=> 13
so for this string we have:
13.downto(1).with_object([]) { |n,arr|...
where arr is an initially-empty array that will be computed and returned. For n => 13,
enum = string_arr.each_cons(13)
#<Enumerator: ["h","e","l","l","o","c","a","r","w","o","r","l","d"]:each_cons(13)>
which enumerates over an array consisting of the single array string_arr:
enum.size #=> 1
enum.first == string_arr #=> true
That single array is assigned to the block variable a, so we obtain:
word = enum.first.join
#=> "hellocarworld"
We find
dict.include?(word) #=> false
so this word is not added to the array arr. It is was in the dictionary we would check to make sure it was not a substring of any word already in arr, which are all of the same size or larger (longer words).
Next we compute:
enum = string_arr.each_cons(12)
#<Enumerator: ["h","e","l","l","o","c","a","r","w","o","r","l","d"]:each_cons(12)>
which we can see enumerates two arrays:
enum = string_arr.each_cons(12).to_a
#=> [["h", "e", "l", "l", "o", "c", "a", "r", "w", "o", "r", "l"],
# ["e", "l", "l", "o", "c", "a", "r", "w", "o", "r", "l", "d"]]
corresponding to the words:
enum.first.join #=> "hellocarworl"
enum.last.join #=> "ellocarworld"
neither of which are in the dictionary. We continue in this fashion, until we reach n => 1:
string_arr.each_cons(1).to_a
#=> [["h"], ["e"], ["l"], ["l"], ["o"], ["c"],
# ["a"], ["r"], ["w"], ["o"], ["r"], ["l"], ["d"]]
We find only "a" in the dictionary, but as it is a substring of "loca" or "car", which are already elements of the array arr, we do not add it.
This can be a bit tricky if you're not familiar with the technique. I often lean heavily on regular expressions for this:
words = File.readlines("/usr/share/dict/words").collect(&:strip).reject(&:empty?)
regexp = Regexp.new(words.sort_by(&:length).reverse.join('|'))
phrase = "hellocarworld"
equiv = [ ]
while (m = phrase.match(regexp)) do
phrase.gsub!(m[0]) do
equiv << m[0]
'*'
end
end
equiv
# => ["hello", "car", "world"]
Update: Strip out blank strings which would cause the while loop to run forever.
Starting at the beginning of the input string, find the longest word in the dictionary. Chop that word off the beginning of the input string and repeat.
Once the input string is empty, you are done. If the string is not empty but no word was found, remove the first character and continue the process.

Resources