Concatenate array elements by groups of 3? - ruby

I have this array:
strings = %w(John likes Pie Diana prefers Cupcakes)
Which will look like:
strings[0] -> "John"
strings[1] -> "likes"
strings[2] -> "Pie"
strings[3] -> "Diana"
strings[4] -> "prefers"
strings[5] -> "Cupcakes"
How can I transform it into this?
strings[0] -> "John likes Pie"
strings[1] -> "Diana prefers Cupcakes"

strings = strings.each_slice(3).map{|a| a.join(" ")}

It appears (note highlighting in question) that verbs are lowercase and everything else is uppercase. Assuming that is the case*, and that the subject ('John') and verb ('likes') are always a single word, and only the first word of the object ('Apple pie') is capitalized, this should work:
def pull_substrings(strings)
(strings.each_with_index
.select { |w,_| w[0] =~ /[a-z]/ }
.map { |_,i| i-1 } << strings.size)
.each_cons(2).map { |f,lp1| strings[f...lp1].join(' ') }
end
Let's try it:
strings = %w[John likes Hot Dogs Diana prefers Cupcakes ] +
%w[Billy-Bob devourers Hot Dogs Chips And Beer]
#=> ["John", "likes", "Hot", "Dogs",
# "Diana", "prefers", "Cupcakes",
# "Billy-Bob", "devourers", "Hot", "Dogs", "Chips", "And", "Beer"]
pull_substrings(strings)
#=> ["John likes Hot Dogs", "Diana prefers Cupcakes",
# "Billy-Bob devourers Hot Dogs Chips And Beer"]
Here's what's going on with the above array strings:
# Save each word with its index
a = strings.each_with_index
#=> #<Enumerator: ...>
a.to_a #=> [["John", 0], ["likes", 1], ["Hot", 2], ["Dogs", 3],
# ["Diana", 4], ["prefers", 5], ["Cupcakes", 6],
# ["Billy-Bob", 7], ["devourers", 8], ["Hot", 9], ["Dogs", 10],
# ["Chips", 11], ["And", 12], ["Beer", 13]]
# Locate the positions of the verbs
b = a.select { |w,_| w[0] =~ /[a-z]/ }
#=> [["likes", 1], ["prefers", 5], ["devourers", 8]]
# Convert to the locations of the subjects (offsets where strings begin)
c = b.map { |_,i| i-1 }
#=> [0, 4, 7]
# Add the position of the last word of the last substring plus 1
d = c << strings.size
#=> [0, 4, 7, 14]
# Look at each pair of subject offsets
e = (d).each_cons(2)
#=> #<Enumerator: ...>
e.to_a #=> [[0, 4], [4, 7], [7, 14]]
# Map each pair of offsets to a substring
e.map { |f,lp1| strings[f...lp1].join(' ') }
#=> ["John likes Hot Dogs",
# "Diana prefers Cupcakes",
# "Billy-Bob devourers Hot Dogs Chips And Beer"]
The first element of e passed to the block following map is `[0, 4], sof => 0, lp1 => 4` and
strings[0...4].join(' ') => ["John", "likes", "Hot", "Dogs"] => "John likes Hot Dogs"
I initially tried converting strings to a string, words separated with a space, and attempted to use a regex, but that was problematic.
Pun unintended

Related

How to sort only specific elements in an array?

I have an array of mixed elements, e.g. integers and strings:
ary = [3, "foo", 2, 5, "bar", 1, "baz", 4]
and I want to apply a sort but only to specific elements. In the above example, the elements to-be-sorted are the integers (but it could be anything). However, even though they should be sorted, they have to stay in their "integer spots". And the strings have to remain in their exact positions.
The sort should work like this:
[3, "foo", 2, 5, "bar", 1, "baz", 4] # before
[1, "foo", 2, 3, "bar", 4, "baz", 5] # after
"foo" is still at index 1, "bar" is at index 4 and "baz" is at index 6.
I could partition the array into integers and non-integers along with their positions:
a, b = ary.each_with_index.partition { |e, i| e.is_a?(Integer) }
a #=> [[3, 0], [2, 2], [5, 3], [1, 5], [4, 7]]
b #=> [["foo", 1], ["bar", 4], ["baz", 6]]
sort the integers:
result = a.map(&:first).sort
#=>[1, 2, 3, 4, 5]
And re-insert the non-integers at their original positions:
b.each { |e, i| result.insert(i, e) }
result
#=> [1, "foo", 2, 3, "bar", 4, "baz", 5]
But this seems rather clumsy. I particular dislike having to deconstruct and rebuild the array one string at a time. Is there a more elegant or more direct approach?
Possible solution
ary = [3, "foo", 2, 5, "bar", 1, "baz", 4]
integers = ary.select(&->(el) { el.is_a?(Integer) }).sort
ary.map { |n| n.is_a?(Integer) ? integers.shift : n }
# => [1, "foo", 2, 3, "bar", 4, "baz", 5]
I am not proficient with Ruby. Though I'd like to take a shot at what I could come up with.
The idea is as evoked in my comment.
get the indices of the integers
sort the values of the indices
insert the sorted values back into array
ary = [3, "foo", 2, 5, "bar", 1, "baz", 4]
=> [3, "foo", 2, 5, "bar", 1, "baz", 4]
indices = ary.map.with_index { |item,idx| idx if item.is_a?(Integer) }.compact
=> [0, 2, 3, 5, 7]
values = ary.values_at(*indices).sort
=> [1, 2, 3, 4, 5]
indices.zip(values).each { |idx, val| ary[idx]=val}
=> [[0, 1], [2, 2], [3, 3], [5, 4], [7, 5]]
ary
=> [1, "foo", 2, 3, "bar", 4, "baz", 5]
I have assumed that, as in the example, if arr = ary.dup and arr is modified ary is not mutated. If that is not the case one must work with a deep copy of ary.
A helper:
def sort_object?(e)
e.class == Integer
end
sort_object?(3)
#=> true
sort_object?(-3e2)
#=> true
sort_object?("foo")
#=> false
ary = [3, "foo", 2, 5, "bar", 1, "baz", 4]
obj_to_idx = ary.zip((0..ary.size-1).to_a).to_h
#=> {3=>0, "foo"=>1, 2=>2, 5=>3, "bar"=>4, 1=>5, "baz"=>6, 4=>7}
to_sort = ary.select { |k| sort_object?(k) }
#=> [3, 2, 5, 1, 4]
sorted = to_sort.sort
#=> [1, 2, 3, 4, 5]
new_idx_to_orig_idx = to_sort.map { |n| obj_to_idx[n] }
.zip(sorted.map { |n| obj_to_idx[n] })
.to_h
#=> {0=>5, 2=>2, 3=>0, 5=>7, 7=>3}
new_idx_to_orig_idx.each_with_object(ary.dup) do |(new_idx,orig_idx),a|
a[new_idx] = ary[orig_idx]
end
#=> [1, "foo", 2, 3, "bar", 4, "baz", 5]
Some of these statements may of course be chained if desired.
In-Place Re-Assignment at Designated Array Indices
You could certainly make this shorter, and perhaps even skip converting things to and from Hash objects, but the intermediate steps there are to show my thought process and make the intent more explicit and debugging a bit easier. Consider the following:
ary = [3, "foo", 2, 5, "bar", 1, "baz", 4]
ints = ary.each_with_index.select { |elem, idx| [elem, idx] if elem.kind_of? Integer }.to_h
ordered_ints = ints.keys.sort.zip(ints.values).to_h
ints.keys.sort.zip(ints.values).each { |elem, idx| ary[idx] = elem }
ary
#=> [1, "foo", 2, 3, "bar", 4, "baz", 5]
The idea here is that we:
Select just the items of the type we want to sort, along with their index within the current Array using #each_with_index. If you don't like #kind_of? you could replace it with #respond_to? or any other selection criteria that makes sense to you.
Sort Array values selected, and then #zip them up along with the index locations we're going to modify.
Re-assign the sorted elements for each index that needs to be replaced.
With this approach, all unselected elements remain in their original index locations within the ary Array. We're only modifying the values of specific Array indices with the re-ordered items.
I will throw my hat in this ring as well.
It appears the other answers rely on the assumption of uniqueness so I took a different route by building a positional transliteration Hash
(Update: took it a little further than necessary)
def sort_only(ary, sort_klass: Integer)
raise ArgumentError unless sort_klass.is_a?(Class) && sort_klass.respond_to?(:<=>)
# Construct a Hash of [Object,Position] => Object
translator = ary
.each_with_index
.with_object({}) {|a,obj| obj[a] = a.first}
.tap do |t|
# select the keys where the value (original element) is a sort_klass
t.filter_map {|k,v| k if v.is_a?(sort_klass)}
.then do |h|
# sort them and then remap these keys to point at the sorted value
h.sort.each_with_index do |(k,_),idx|
t[h[idx]] = k
end
end
end
# loop through the original array with its position index and use
# the transliteration Hash to place them in the correct order
ary.map.with_index {|a,i| translator[[a,i]]}
end
Example: (Working Example)
sort_only([3, "foo", 2, 5, "bar", 1, "baz", 4])
#=> [1, "foo", 2, 3, "bar", 4, "baz", 5]
sort_only([3, 3, "foo", 2, 5, "bar", 1, "baz",12.0,5,Object, 4,"foo", 7, -1, "qux"])
#=> [-1, 1, "foo", 2, 3, "bar", 3, "baz", 12.0, 4, Object, 5, "foo", 5, 7, "qux"]
sort_only([3, "foo", 2, 5,"qux", "bar", 1, "baz", 4], sort_klass: String)
#=> [3, "bar", 2, 5, "baz", "foo", 1, "qux", 4]
For Reference the Second Example produces the following transliteration:
{[3, 0]=>-1, [3, 1]=>1, ["foo", 2]=>"foo", [2, 3]=>2, [5, 4]=>3,
["bar", 5]=>"bar", [1, 6]=>3, ["baz", 7]=>"baz", [12.0, 8]=>12.0,
[5, 9]=>4, [Object, 10]=>Object, [4, 11]=>5, ["foo", 12]=>"foo",
[7, 13]=>5, [-1, 14]=>7, ["qux", 15]=>"qux"}
Below code simply creates two new arrays: numbers and others. numbers are Integer instances from original array, later sorted in descending order. others looks like original array but number spots replaced with nil. at the end push that sorted numbers to nil spots
arr.inject([[], []]) do |acc, el|
el.is_a?(Integer) ? (acc[0]<<el; acc[1]<<nil) : acc[1]<<el; acc
end.tap do |titself|
numbers = titself[0].sort_by {|e| -e }
titself[1].each_with_index do |u, i|
titself[1][i] = numbers.pop unless u
end
end.last
I'm going to answer my own question here with yet another approach to the problem. This solution delegates most work to Ruby's built-in methods and avoids explicit blocks / loops as far as possible.
Starting with the input array:
ary = [3, "foo", 2, 5, "bar", 1, "baz", 4]
You could build a hash of position => element pairs:
hash = ary.each_index.zip(ary).to_h
#=> {0=>3, 1=>"foo", 2=>2, 3=>5, 4=>"bar", 5=>1, 6=>"baz", 7=>4}
extract the pairs having integer value: (or whatever you want to sort)
ints_hash = hash.select { |k, v| v.is_a?(Integer) }
#=> {0=>3, 2=>2, 3=>5, 5=>1, 7=>4}
sort their values: (any way you want)
sorted_ints = ints_hash.values.sort
#=> [1, 2, 3, 4, 5]
build a new mapping for the sorted values:
sorted_ints_hash = ints_hash.keys.zip(sorted_ints).to_h
#=> {0=>1, 2=>2, 3=>3, 5=>4, 7=>5}
update the position hash:
hash.merge!(sorted_ints_hash)
#=> {0=>1, 1=>"foo", 2=>2, 3=>3, 4=>"bar", 5=>4, 6=>"baz", 7=>5}
And voilĂ :
hash.values
#=> [1, "foo", 2, 3, "bar", 4, "baz", 5]

How to find indices of identical sub-sequences in two strings in Ruby?

Here each instance of the class DNA corresponds to a string such as 'GCCCAC'. Arrays of substrings containing k-mers can be constructed from these strings. For this string there are 1-mers, 2-mers, 3-mers, 4-mers, 5-mers and one 6-mer:
6 1-mers: ["G", "C", "C", "C", "A", "C"]
5 2-mers: ["GC", "CC", "CC", "CA", "AC"]
4 3-mers: ["GCC", "CCC", "CCA", "CAC"]
3 4-mers: ["GCCC", "CCCA", "CCAC"]
2 5-mers: ["GCCCA", "CCCAC"]
1 6-mers: ["GCCCAC"]
The pattern should be evident. See the Wiki for details.
The problem is to write the method shared_kmers(k, dna2) of the DNA class which returns an array of all pairs [i, j] where this DNA object (that receives the message) shares with dna2 a common k-mer at position i in this dna and at position j in dna2.
dna1 = DNA.new('GCCCAC')
dna2 = DNA.new('CCACGC')
dna1.shared_kmers(2, dna2)
#=> [[0, 4], [1, 0], [2, 0], [3, 1], [4, 2]]
dna2.shared_kmers(2, dna1)
#=> [[0, 1], [0, 2], [1, 3], [2, 4], [4, 0]]
dna1.shared_kmers(3, dna2)
#=> [[2, 0], [3, 1]]
dna1.shared_kmers(4, dna2)
#=> [[2, 0]]
dna1.shared_kmers(5, dna2)
#=> []
class DNA
attr_accessor :sequencing
def initialize(sequencing)
#sequencing = sequencing
end
def kmers(k)
#sequencing.each_char.each_cons(k).map(&:join)
end
def shared_kmers(k, dna)
kmers(k).each_with_object([]).with_index do |(kmer, result), index|
dna.kmers(k).each_with_index do |other_kmer, other_kmer_index|
result << [index, other_kmer_index] if kmer.eql?(other_kmer)
end
end
end
end
dna1 = DNA.new('GCCCAC')
dna2 = DNA.new('CCACGC')
dna1.kmers(2)
#=> ["GC", "CC", "CC", "CA", "AC"]
dna2.kmers(2)
#=> ["CC", "CA", "AC", "CG", "GC"]
dna1.shared_kmers(2, dna2)
#=> [[0, 4], [1, 0], [2, 0], [3, 1], [4, 2]]
dna2.shared_kmers(2, dna1)
#=> [[0, 1], [0, 2], [1, 3], [2, 4], [4, 0]]
dna1.shared_kmers(3, dna2)
#=> [[2, 0], [3, 1]]
dna1.shared_kmers(4, dna2)
#=> [[2, 0]]
dna1.shared_kmers(5, dna2)
#=> []
I will address the crux of your problem only, without reference to a class DNA. It should be easy to reorganize what follows quite easily.
Code
def match_kmers(s1, s2, k)
h1 = dna_to_index(s1, k)
h2 = dna_to_index(s2, k)
h1.flat_map { |k,_| h1[k].product(h2[k] || []) }
end
def dna_to_index(dna, k)
dna.each_char.
with_index.
each_cons(k).
with_object({}) {|arr,h| (h[arr.map(&:first).join] ||= []) << arr.first.last}
end
Examples
dna1 = 'GCCCAC'
dna2 = 'CCACGC'
match_kmers(dna1, dna2, 2)
#=> [[0, 4], [1, 0], [2, 0], [3, 1], [4, 2]]
match_kmers(dna2, dna1, 2)
#=> [[0, 1], [0, 2], [1, 3], [2, 4], [4, 0]]
match_kmers(dna1, dna2, 3)
#=> [[2, 0], [3, 1]]
match_kmers(dna2, dna1, 3)
#=> [[0, 2], [1, 3]]
match_kmers(dna1, dna2, 4)
#=> [[2, 0]]
match_kmers(dna2, dna1, 4)
#=> [[0, 2]]
match_kmers(dna1, dna2, 5)
#=> []
match_kmers(dna2, dna1, 5)
#=> []
match_kmers(dna1, dna2, 6)
#=> []
match_kmers(dna2, dna1, 6)
#=> []
Explanation
Consider dna1 = 'GCCCAC'. This contains 5 2-mers (k = 2):
dna1.each_char.each_cons(2).to_a.map(&:join)
#=> ["GC", "CC", "CC", "CA", "AC"]
Similarly, for dna2 = 'CCACGC':
dna2.each_char.each_cons(2).to_a.map(&:join)
#=> ["CC", "CA", "AC", "CG", "GC"]
These are the keys of the hashes produced by dna_to_index for dna1 and dna2, respectively. The hash values are arrays of indices of where the corresponding key begins in the DNA string. Let's compute those hashes for k = 2:
h1 = dna_to_index(dna1, 2)
#=> {"GC"=>[0], "CC"=>[1, 2], "CA"=>[3], "AC"=>[4]}
h2 = dna_to_index(dna2, 2)
#=> {"CC"=>[0], "CA"=>[1], "AC"=>[2], "CG"=>[3], "GC"=>[4]}
h1 shows that:
"GC" begins at index 0 of dna1
"CC" begins at indices 1 and 2 of dna1
"CA" begins at index 3 of dna1
"CC" begins at index 4 of dna1
h2 has a similar interpretation. See Enumerable#flat_map and Array#product.
The method match_kmers is then used to construct the desired array of pairs of indices [i, j] such that h1[i] = h2[j].
Now let's look at the hashes produced for 3-mers (k = 3):
h1 = dna_to_index(dna1, 3)
#=> {"GCC"=>[0], "CCC"=>[1], "CCA"=>[2], "CAC"=>[3]}
h2 = dna_to_index(dna2, 3)
#=> {"CCA"=>[0], "CAC"=>[1], "ACG"=>[2], "CGC"=>[3]}
We see that the first 3-mer in dna1 is "GCC", beginning at index 0. This 3-mer does not appear in dna2, however, so there are no elements [0, X] in the array returned (X being just a placeholder). Nor is "CCC" a key in the second hash. "CCA" and "CAC" are present in the second hash, however, so the array returned is:
h1["CCA"].product(h2["CCA"]) + h1["CAC"].product(h2["CAC"])
#=> [[2, 0]] + [[3, 1]]
#=> [[2, 0], [3, 1]]
I would start by writing a method to enumerate subsequences of a given length (i.e. the k-mers):
class DNA
def initialize(sequence)
#sequence = sequence
end
def each_kmer(length)
return enum_for(:each_kmer, length) unless block_given?
0.upto(#sequence.length - length) { |i| yield #sequence[i, length] }
end
end
DNA.new('GCCCAC').each_kmer(2).to_a
#=> ["GC", "CC", "CC", "CA", "AC"]
On top of this you can easily collect the indices of identical k-mers using a nested loop:
class DNA
# ...
def shared_kmers(length, other)
indices = []
each_kmer(length).with_index do |k, i|
other.each_kmer(length).with_index do |l, j|
indices << [i, j] if k == l
end
end
indices
end
end
dna1 = DNA.new('GCCCAC')
dna2 = DNA.new('CCACGC')
dna1.shared_kmers(2, dna2)
#=> [[0, 4], [1, 0], [2, 0], [3, 1], [4, 2]]
Unfortunately, the above code traverses other.each_kmer for each k-mer in the receiver. We can optimize this by building a hash containing all indices for each k-mer in other up-front:
class DNA
# ...
def shared_kmers(length, other)
hash = Hash.new { |h, k| h[k] = [] }
other.each_kmer(length).with_index { |k, i| hash[k] << i }
indices = []
each_kmer(length).with_index do |k, i|
hash[k].each { |j| indices << [i, j] }
end
indices
end
end

How to use collect and include for multidimensional array

I have:
array1 = [[1,2,3,4,5],[7,8,9,10],[11,12,13,14]]
#student_ids = [1,2,3]
I want to replace elements in array1 that are included in #student_ids with 'X'. I want to see:
[['X','X','X',4,5],[7,8,9,10],[11,12,13,14]]
I have code that is intended to do this:
array1.collect! do |i|
if i.include?(#student_ids) #
i[i.index(#student_ids)] = 'X'; i # I want to replace all with X
else
i
end
end
If #student_ids is 1, then it works, but if #student_ids has more than one element such as 1,2,3, it raises errors. Any help?
It's faster to use a hash or a set than to repeatedly test [1,2,3].include?(n).
arr = [[1,2,3,4,5],[7,8,9,10],[11,12,13,14]]
ids = [1,2,3]
Use a hash
h = ids.product(["X"]).to_h
#=> {1=>"X", 2=>"X", 3=>"X"}
arr.map { |a| a.map { |n| h.fetch(n, n) } }
#=> [["X", "X", "X", 4, 5], [7, 8, 9, 10], [11, 12, 13, 14]]
See Hash#fetch.
Use a set
require 'set'
ids = ids.to_set
#=> #<Set: {1, 2, 3}>
arr.map { |a| a.map { |n| ids.include?(n) ? "X" : n } }
#=> [["X", "X", "X", 4, 5], [7, 8, 9, 10], [11, 12, 13, 14]]
Replace both maps with map! if the array is to be modified in place (mutated).
Try following, (taking #student_ids = [1, 2, 3])
array1.inject([]) { |m,a| m << a.map { |x| #student_ids.include?(x) ? 'X' : x } }
# => [["X", "X", "X", 4, 5], [7, 8, 9, 10], [11, 12, 13, 14]]
You can use each_with_index and replace the item you want:
array1 = [[1,2,3,4,5],[7,8,9,10],[11,12,13,14]]
#student_ids = [1,2,3]
array1.each_with_index do |sub_array, index|
sub_array.each_with_index do |item, index2|
array1[index][index2] = 'X' if #student_ids.include?(item)
end
end
You can do the following:
def remove_student_ids(arr)
arr.each_with_index do |value, index|
arr[index] = 'X' if #student_ids.include?(value) }
end
end
array1.map{ |sub_arr| remove_student_ids(sub_arr)}

Ruby: How to find the most frequent substring of length n? [duplicate]

I have this program with a class DNA. The program counts the most frequent k-mer in a string. So, it is looking for the most common substring in a string with a length of k.
An example would be creating a dna1 object with a string of AACCAATCCG. The count k-mer method will look for a subtring with a length of k and output the most common answer. So, if we set k = 1 then 'A' and 'C' will be the most occurrence in the string because it appears four times. See example below:
dna1 = DNA.new('AACCAATCCG')
=> AACCAATCCG
>> dna1.count_kmer(1)
=> [#<Set: {"A", "C"}>, 4]
>> dna1.count_kmer(2)
=> [#<Set: {"AA", "CC"}>, 2]
Here is my DNA class :
class DNA
def initialize (nucleotide)
#nucleotide = nucleotide
end
def length
#nucleotide.length
end
protected
attr_reader :nucleotide
end
Here is my count kmer method that I am trying to implement:
# I have k as my only parameter because I want to pass the nucleotide string in the method
def count_kmer(k)
# I created an array as it seems like a good way to split up the nucleotide string.
counts = []
#this tries to count how many kmers of length k there are
num_kmers = self.nucleotide.length- k + 1
#this should try and look over the kmer start positions
for i in num_kmers
#Slice the string, so that way we can get the kmer
kmer = self.nucleotide.split('')
end
#add kmer if its not present
if !kmer = counts
counts[kmer] = 0
#increment the count for kmer
counts[kmer] +=1
end
#return the final count
return counts
end
#end dna class
end
I'm not sure where my method went wrong.
Something like this?
require 'set'
def count_kmer(k)
max_kmers = kmers(k)
.each_with_object(Hash.new(0)) { |value, count| count[value] += 1 }
.group_by { |_,v| v }
.max
[Set.new(max_kmers[1].map { |e| e[0] }), max_kmers[0]]
end
def kmers(k)
nucleotide.chars.each_cons(k).map(&:join)
end
EDIT: Here's the full text of the class:
require 'set'
class DNA
def initialize (nucleotide)
#nucleotide = nucleotide
end
def length
#nucleotide.length
end
def count_kmer(k)
max_kmers = kmers(k)
.each_with_object(Hash.new(0)) { |value, count| count[value] += 1 }
.group_by { |_,v| v }
.max
[Set.new(max_kmers[1].map { |e| e[0] }), max_kmers[0]]
end
def kmers(k)
nucleotide.chars.each_cons(k).map(&:join)
end
protected
attr_reader :nucleotide
end
This produces the following output, using Ruby 2.2.1, using the class and method you specified:
>> dna1 = DNA.new('AACCAATCCG')
=> #<DNA:0x007fe15205bc30 #nucleotide="AACCAATCCG">
>> dna1.count_kmer(1)
=> [#<Set: {"A", "C"}>, 4]
>> dna1.count_kmer(2)
=> [#<Set: {"AA", "CC"}>, 2]
As a bonus, you can also do:
>> dna1.kmers(2)
=> ["AA", "AC", "CC", "CA", "AA", "AT", "TC", "CC", "CG"]
Code
def most_frequent_substrings(str, k)
(0..str.size-k).each_with_object({}) do |i,h|
b = []
str[i..-1].scan(Regexp.new str[i,k]) { b << Regexp.last_match.begin(0) + i }
(h[b.size] ||= []) << b
end.max_by(&:first).last.each_with_object({}) { |a,h| h[str[a.first,k]] = a }
end
Example
str = "ABBABABBABCATSABBABB"
most_frequent_substrings(str, 4)
#=> {"ABBA"=>[0, 5, 14], "BBAB"=>[1, 6, 15]}
This shows that the most frequently-occurring 4-character substring of strappears 3 times. There are two such substrings: "ABBA" and "BBAB". "ABBA" begins at offsets (into str) 0, 5 and 14, "BBAB" substrings begin at offsets 1, 6 and 15.
Explanation
For the example above the steps are as follows.
k = 4
n = str.size - k
#=> 20 - 4 => 16
e = (0..n).each_with_object([])
#<Enumerator: 0..16:each_with_object([])>
We can see the values that will be generated by this enumerator by converting it to an array.
e.to_a
#=> [[0, []], [1, []], [2, []], [3, []], [4, []], [5, []], [6, []], [7, []], [8, []],
# [9, []], [10, []], [11, []], [12, []], [13, []], [14, []], [15, []], [16, []]]
Note the empty array contained in each element will be modified as the array is built. Continuing, the first element of e is passed to the block and the block variables are assigned using parallel assignment:
i,a = e.next
#=> [0, []]
i #=> 0
a #=> []
We are now considering the substring of size 4 that begins at str offset i #=> 0, which is seen to be "ABBA". Now the block calculation is performed.
b = []
r = Regexp.new str[i,k]
#=> Regexp.new str[0,4]
#=> Regexp.new "ABBA"
#=> /ABAB/
str[i..-1].scan(r) { b << Regexp.last_match.begin(0) + i }
#=> "ABBABABBABCATSABBABB".scan(r) { b << Regexp.last_match.begin(0) + i }
b #=> [0, 5, 14]
We next have
(h[b.size] ||= []) << b
which becomes
(h[b.size] = h[b.size] || []) << b
#=> (h[3] = h[3] || []) << [0, 5, 14]
Since h has no key 3, h[3] on the right side equals nil. Continuing,
#=> (h[3] = nil || []) << [0, 5, 14]
#=> (h[3] = []) << [0, 5, 14]
h #=> { 3=>[[0, 5, 14]] }
Notice that we throw away scan's return value. All we need is b
This tells us the "ABBA" appears thrice in str, beginning at offsets 0, 5 and 14.
Now observe
e.to_a
#=> [[0, [[0, 5, 14]]], [1, [[0, 5, 14]]], [2, [[0, 5, 14]]],
# ...
# [16, [[0, 5, 14]]]]
After all elements of e have been passed to the block, the block returns
h #=> {3=>[[0, 5, 14], [1, 6, 15]],
# 1=>[[2], [3], [7], [8], [9], [10], [11], [12], [13], [14], [15], [16]],
# 2=>[[4, 16], [5, 14], [6, 15]]}
Consider substrings that appear just once: h[1]. One of those is [2]. This pertains to the 4-character substring beginning at str offset 2:
str[2,4]
#=> "BABA"
That is found to be the only instance of that substring. Similarly, among the substrings that appear twice is str[4,4] = str[16,4] #=> "BABB", given by h[2][0] #=> [4, 16].
Next we determine the greatest frequency of a substring of length 4:
c = h.max_by(&:first)
#=> [3, [[0, 5, 14], [1, 6, 15]]]
(which could also be written c = h.max_by { |k,_| k }).
d = c.last
#=> [[0, 5, 14], [1, 6, 15]]
For convenience, convert d to a hash:
d.each_with_object({}) { |a,h| h[str[a.first,k]] = a }
#=> {"ABBA"=>[0, 5, 14], "BBAB"=>[1, 6, 15]}
and return that hash from the method.
There is one detail that deserves mention. It is possible that d will contain two or more arrays that reference the same substring, in which case the value of the associated key (the substring) will equal the last of those arrays. Here's a simple example.
str = "AAA"
k = 2
In this case the array d above will equal
d = [[0], [1]]
Both of these reference str[0,2] #=> str[1,2] #=> "AA". In building the hash the first is overwritten by the second:
d.each_with_object({}) { |a,h| h[str[a.first,k]] = a }
#=> {"AA"=>[1]}

How to write a method that counts the most common substring in a string in ruby?

I have this program with a class DNA. The program counts the most frequent k-mer in a string. So, it is looking for the most common substring in a string with a length of k.
An example would be creating a dna1 object with a string of AACCAATCCG. The count k-mer method will look for a subtring with a length of k and output the most common answer. So, if we set k = 1 then 'A' and 'C' will be the most occurrence in the string because it appears four times. See example below:
dna1 = DNA.new('AACCAATCCG')
=> AACCAATCCG
>> dna1.count_kmer(1)
=> [#<Set: {"A", "C"}>, 4]
>> dna1.count_kmer(2)
=> [#<Set: {"AA", "CC"}>, 2]
Here is my DNA class :
class DNA
def initialize (nucleotide)
#nucleotide = nucleotide
end
def length
#nucleotide.length
end
protected
attr_reader :nucleotide
end
Here is my count kmer method that I am trying to implement:
# I have k as my only parameter because I want to pass the nucleotide string in the method
def count_kmer(k)
# I created an array as it seems like a good way to split up the nucleotide string.
counts = []
#this tries to count how many kmers of length k there are
num_kmers = self.nucleotide.length- k + 1
#this should try and look over the kmer start positions
for i in num_kmers
#Slice the string, so that way we can get the kmer
kmer = self.nucleotide.split('')
end
#add kmer if its not present
if !kmer = counts
counts[kmer] = 0
#increment the count for kmer
counts[kmer] +=1
end
#return the final count
return counts
end
#end dna class
end
I'm not sure where my method went wrong.
Something like this?
require 'set'
def count_kmer(k)
max_kmers = kmers(k)
.each_with_object(Hash.new(0)) { |value, count| count[value] += 1 }
.group_by { |_,v| v }
.max
[Set.new(max_kmers[1].map { |e| e[0] }), max_kmers[0]]
end
def kmers(k)
nucleotide.chars.each_cons(k).map(&:join)
end
EDIT: Here's the full text of the class:
require 'set'
class DNA
def initialize (nucleotide)
#nucleotide = nucleotide
end
def length
#nucleotide.length
end
def count_kmer(k)
max_kmers = kmers(k)
.each_with_object(Hash.new(0)) { |value, count| count[value] += 1 }
.group_by { |_,v| v }
.max
[Set.new(max_kmers[1].map { |e| e[0] }), max_kmers[0]]
end
def kmers(k)
nucleotide.chars.each_cons(k).map(&:join)
end
protected
attr_reader :nucleotide
end
This produces the following output, using Ruby 2.2.1, using the class and method you specified:
>> dna1 = DNA.new('AACCAATCCG')
=> #<DNA:0x007fe15205bc30 #nucleotide="AACCAATCCG">
>> dna1.count_kmer(1)
=> [#<Set: {"A", "C"}>, 4]
>> dna1.count_kmer(2)
=> [#<Set: {"AA", "CC"}>, 2]
As a bonus, you can also do:
>> dna1.kmers(2)
=> ["AA", "AC", "CC", "CA", "AA", "AT", "TC", "CC", "CG"]
Code
def most_frequent_substrings(str, k)
(0..str.size-k).each_with_object({}) do |i,h|
b = []
str[i..-1].scan(Regexp.new str[i,k]) { b << Regexp.last_match.begin(0) + i }
(h[b.size] ||= []) << b
end.max_by(&:first).last.each_with_object({}) { |a,h| h[str[a.first,k]] = a }
end
Example
str = "ABBABABBABCATSABBABB"
most_frequent_substrings(str, 4)
#=> {"ABBA"=>[0, 5, 14], "BBAB"=>[1, 6, 15]}
This shows that the most frequently-occurring 4-character substring of strappears 3 times. There are two such substrings: "ABBA" and "BBAB". "ABBA" begins at offsets (into str) 0, 5 and 14, "BBAB" substrings begin at offsets 1, 6 and 15.
Explanation
For the example above the steps are as follows.
k = 4
n = str.size - k
#=> 20 - 4 => 16
e = (0..n).each_with_object([])
#<Enumerator: 0..16:each_with_object([])>
We can see the values that will be generated by this enumerator by converting it to an array.
e.to_a
#=> [[0, []], [1, []], [2, []], [3, []], [4, []], [5, []], [6, []], [7, []], [8, []],
# [9, []], [10, []], [11, []], [12, []], [13, []], [14, []], [15, []], [16, []]]
Note the empty array contained in each element will be modified as the array is built. Continuing, the first element of e is passed to the block and the block variables are assigned using parallel assignment:
i,a = e.next
#=> [0, []]
i #=> 0
a #=> []
We are now considering the substring of size 4 that begins at str offset i #=> 0, which is seen to be "ABBA". Now the block calculation is performed.
b = []
r = Regexp.new str[i,k]
#=> Regexp.new str[0,4]
#=> Regexp.new "ABBA"
#=> /ABAB/
str[i..-1].scan(r) { b << Regexp.last_match.begin(0) + i }
#=> "ABBABABBABCATSABBABB".scan(r) { b << Regexp.last_match.begin(0) + i }
b #=> [0, 5, 14]
We next have
(h[b.size] ||= []) << b
which becomes
(h[b.size] = h[b.size] || []) << b
#=> (h[3] = h[3] || []) << [0, 5, 14]
Since h has no key 3, h[3] on the right side equals nil. Continuing,
#=> (h[3] = nil || []) << [0, 5, 14]
#=> (h[3] = []) << [0, 5, 14]
h #=> { 3=>[[0, 5, 14]] }
Notice that we throw away scan's return value. All we need is b
This tells us the "ABBA" appears thrice in str, beginning at offsets 0, 5 and 14.
Now observe
e.to_a
#=> [[0, [[0, 5, 14]]], [1, [[0, 5, 14]]], [2, [[0, 5, 14]]],
# ...
# [16, [[0, 5, 14]]]]
After all elements of e have been passed to the block, the block returns
h #=> {3=>[[0, 5, 14], [1, 6, 15]],
# 1=>[[2], [3], [7], [8], [9], [10], [11], [12], [13], [14], [15], [16]],
# 2=>[[4, 16], [5, 14], [6, 15]]}
Consider substrings that appear just once: h[1]. One of those is [2]. This pertains to the 4-character substring beginning at str offset 2:
str[2,4]
#=> "BABA"
That is found to be the only instance of that substring. Similarly, among the substrings that appear twice is str[4,4] = str[16,4] #=> "BABB", given by h[2][0] #=> [4, 16].
Next we determine the greatest frequency of a substring of length 4:
c = h.max_by(&:first)
#=> [3, [[0, 5, 14], [1, 6, 15]]]
(which could also be written c = h.max_by { |k,_| k }).
d = c.last
#=> [[0, 5, 14], [1, 6, 15]]
For convenience, convert d to a hash:
d.each_with_object({}) { |a,h| h[str[a.first,k]] = a }
#=> {"ABBA"=>[0, 5, 14], "BBAB"=>[1, 6, 15]}
and return that hash from the method.
There is one detail that deserves mention. It is possible that d will contain two or more arrays that reference the same substring, in which case the value of the associated key (the substring) will equal the last of those arrays. Here's a simple example.
str = "AAA"
k = 2
In this case the array d above will equal
d = [[0], [1]]
Both of these reference str[0,2] #=> str[1,2] #=> "AA". In building the hash the first is overwritten by the second:
d.each_with_object({}) { |a,h| h[str[a.first,k]] = a }
#=> {"AA"=>[1]}

Resources