How to find indices of identical sub-sequences in two strings in Ruby? - ruby

Here each instance of the class DNA corresponds to a string such as 'GCCCAC'. Arrays of substrings containing k-mers can be constructed from these strings. For this string there are 1-mers, 2-mers, 3-mers, 4-mers, 5-mers and one 6-mer:
6 1-mers: ["G", "C", "C", "C", "A", "C"]
5 2-mers: ["GC", "CC", "CC", "CA", "AC"]
4 3-mers: ["GCC", "CCC", "CCA", "CAC"]
3 4-mers: ["GCCC", "CCCA", "CCAC"]
2 5-mers: ["GCCCA", "CCCAC"]
1 6-mers: ["GCCCAC"]
The pattern should be evident. See the Wiki for details.
The problem is to write the method shared_kmers(k, dna2) of the DNA class which returns an array of all pairs [i, j] where this DNA object (that receives the message) shares with dna2 a common k-mer at position i in this dna and at position j in dna2.
dna1 = DNA.new('GCCCAC')
dna2 = DNA.new('CCACGC')
dna1.shared_kmers(2, dna2)
#=> [[0, 4], [1, 0], [2, 0], [3, 1], [4, 2]]
dna2.shared_kmers(2, dna1)
#=> [[0, 1], [0, 2], [1, 3], [2, 4], [4, 0]]
dna1.shared_kmers(3, dna2)
#=> [[2, 0], [3, 1]]
dna1.shared_kmers(4, dna2)
#=> [[2, 0]]
dna1.shared_kmers(5, dna2)
#=> []

class DNA
attr_accessor :sequencing
def initialize(sequencing)
#sequencing = sequencing
end
def kmers(k)
#sequencing.each_char.each_cons(k).map(&:join)
end
def shared_kmers(k, dna)
kmers(k).each_with_object([]).with_index do |(kmer, result), index|
dna.kmers(k).each_with_index do |other_kmer, other_kmer_index|
result << [index, other_kmer_index] if kmer.eql?(other_kmer)
end
end
end
end
dna1 = DNA.new('GCCCAC')
dna2 = DNA.new('CCACGC')
dna1.kmers(2)
#=> ["GC", "CC", "CC", "CA", "AC"]
dna2.kmers(2)
#=> ["CC", "CA", "AC", "CG", "GC"]
dna1.shared_kmers(2, dna2)
#=> [[0, 4], [1, 0], [2, 0], [3, 1], [4, 2]]
dna2.shared_kmers(2, dna1)
#=> [[0, 1], [0, 2], [1, 3], [2, 4], [4, 0]]
dna1.shared_kmers(3, dna2)
#=> [[2, 0], [3, 1]]
dna1.shared_kmers(4, dna2)
#=> [[2, 0]]
dna1.shared_kmers(5, dna2)
#=> []

I will address the crux of your problem only, without reference to a class DNA. It should be easy to reorganize what follows quite easily.
Code
def match_kmers(s1, s2, k)
h1 = dna_to_index(s1, k)
h2 = dna_to_index(s2, k)
h1.flat_map { |k,_| h1[k].product(h2[k] || []) }
end
def dna_to_index(dna, k)
dna.each_char.
with_index.
each_cons(k).
with_object({}) {|arr,h| (h[arr.map(&:first).join] ||= []) << arr.first.last}
end
Examples
dna1 = 'GCCCAC'
dna2 = 'CCACGC'
match_kmers(dna1, dna2, 2)
#=> [[0, 4], [1, 0], [2, 0], [3, 1], [4, 2]]
match_kmers(dna2, dna1, 2)
#=> [[0, 1], [0, 2], [1, 3], [2, 4], [4, 0]]
match_kmers(dna1, dna2, 3)
#=> [[2, 0], [3, 1]]
match_kmers(dna2, dna1, 3)
#=> [[0, 2], [1, 3]]
match_kmers(dna1, dna2, 4)
#=> [[2, 0]]
match_kmers(dna2, dna1, 4)
#=> [[0, 2]]
match_kmers(dna1, dna2, 5)
#=> []
match_kmers(dna2, dna1, 5)
#=> []
match_kmers(dna1, dna2, 6)
#=> []
match_kmers(dna2, dna1, 6)
#=> []
Explanation
Consider dna1 = 'GCCCAC'. This contains 5 2-mers (k = 2):
dna1.each_char.each_cons(2).to_a.map(&:join)
#=> ["GC", "CC", "CC", "CA", "AC"]
Similarly, for dna2 = 'CCACGC':
dna2.each_char.each_cons(2).to_a.map(&:join)
#=> ["CC", "CA", "AC", "CG", "GC"]
These are the keys of the hashes produced by dna_to_index for dna1 and dna2, respectively. The hash values are arrays of indices of where the corresponding key begins in the DNA string. Let's compute those hashes for k = 2:
h1 = dna_to_index(dna1, 2)
#=> {"GC"=>[0], "CC"=>[1, 2], "CA"=>[3], "AC"=>[4]}
h2 = dna_to_index(dna2, 2)
#=> {"CC"=>[0], "CA"=>[1], "AC"=>[2], "CG"=>[3], "GC"=>[4]}
h1 shows that:
"GC" begins at index 0 of dna1
"CC" begins at indices 1 and 2 of dna1
"CA" begins at index 3 of dna1
"CC" begins at index 4 of dna1
h2 has a similar interpretation. See Enumerable#flat_map and Array#product.
The method match_kmers is then used to construct the desired array of pairs of indices [i, j] such that h1[i] = h2[j].
Now let's look at the hashes produced for 3-mers (k = 3):
h1 = dna_to_index(dna1, 3)
#=> {"GCC"=>[0], "CCC"=>[1], "CCA"=>[2], "CAC"=>[3]}
h2 = dna_to_index(dna2, 3)
#=> {"CCA"=>[0], "CAC"=>[1], "ACG"=>[2], "CGC"=>[3]}
We see that the first 3-mer in dna1 is "GCC", beginning at index 0. This 3-mer does not appear in dna2, however, so there are no elements [0, X] in the array returned (X being just a placeholder). Nor is "CCC" a key in the second hash. "CCA" and "CAC" are present in the second hash, however, so the array returned is:
h1["CCA"].product(h2["CCA"]) + h1["CAC"].product(h2["CAC"])
#=> [[2, 0]] + [[3, 1]]
#=> [[2, 0], [3, 1]]

I would start by writing a method to enumerate subsequences of a given length (i.e. the k-mers):
class DNA
def initialize(sequence)
#sequence = sequence
end
def each_kmer(length)
return enum_for(:each_kmer, length) unless block_given?
0.upto(#sequence.length - length) { |i| yield #sequence[i, length] }
end
end
DNA.new('GCCCAC').each_kmer(2).to_a
#=> ["GC", "CC", "CC", "CA", "AC"]
On top of this you can easily collect the indices of identical k-mers using a nested loop:
class DNA
# ...
def shared_kmers(length, other)
indices = []
each_kmer(length).with_index do |k, i|
other.each_kmer(length).with_index do |l, j|
indices << [i, j] if k == l
end
end
indices
end
end
dna1 = DNA.new('GCCCAC')
dna2 = DNA.new('CCACGC')
dna1.shared_kmers(2, dna2)
#=> [[0, 4], [1, 0], [2, 0], [3, 1], [4, 2]]
Unfortunately, the above code traverses other.each_kmer for each k-mer in the receiver. We can optimize this by building a hash containing all indices for each k-mer in other up-front:
class DNA
# ...
def shared_kmers(length, other)
hash = Hash.new { |h, k| h[k] = [] }
other.each_kmer(length).with_index { |k, i| hash[k] << i }
indices = []
each_kmer(length).with_index do |k, i|
hash[k].each { |j| indices << [i, j] }
end
indices
end
end

Related

Ruby, remove super-arrays

If I have an array of arrays, A, and want to get rid of all arrays in A who also have a sub-array in A, how would I do that. In this context, array_1 is a sub-array of array_2 if array_1 - array_2 = []. In the case that multiple arrays are simply rearranged versions of the same elements, bonus points if you can get rid of all but one of them, but you can handle this however you want if it's easier.
In python, I could easily use comprehension, with A being a set of frozen sets :
A = {a for a in A if all(b-a for b in A-{a})}
Is there a simple way to write this in ruby? I don't care if the order of A or it's arrays are preserved at all. Also, in my program, none of the arrays have duplicate elements, if that makes things any easier/faster.
Example
A = [[1,6],[1,2],[2,4],[3,5],[1,3,6],[2,3,6]]
# [1,6] is a subarray of [1,3,6], so [1,3,6] should be removed
remove_super_arrays(A)
> A = [[1,6],[1,2],[2,4],[3,5],[2,3,6]]
A = [[1,2,4],[2,3,4],[1,4,5],[2,6]]
# although there is overlap, there are no subarrays, so nothing should be removed
remove_super_arrays(A)
> A = [[1,2,4],[2,3,4],[1,4,5],[2,6]]
A = [[1],[2,1,3],[2,4],[1,4]]
# [1] is a subarray of [2,1,3] and [1,4]
remove_super_arrays(A)
> A = [[1],[2,4]]
Code
def remove_super_arrays(arr)
order = arr.each_with_index.to_a.to_h
arr.sort_by(&:size).reject.with_index do |a,i|
arr[0,i].any? { |aa| (aa.size < a.size) && (aa-a).empty? }
end.sort_by { |a| order[a] }
end
Examples
remove_super_arrays([[1,6],[1,2],[2,4],[3,5],[1,3,6],[2,3,6]] )
#=> [[1,6],[1,2],[2,4],[3,5],[2,3,6]]
remove_super_arrays([[1,2,4],[2,3,4],[1,4,5],[2,6]])
#=> [[1,2,4],[2,3,4],[1,4,5],[2,6]]
remove_super_arrays([[1],[2,1,3],[2,4],[1,4]])
#=> [[1],[2,4]]
Explanation
Consider the first example.
arr = [[1,6],[1,2],[2,4],[3,5],[1,3,6],[2,3,6]]
We first save the positions of the elements of a
order = arr.each_with_index.to_a.to_h # save original order
#=> {[1, 6]=>0, [1, 2]=>1, [2, 4]=>2, [3, 5]=>3, [1, 3, 6]=>4, [2, 3, 6]=>5}
Then reject elements of arr:
b = arr.sort_by(&:size)
#=> [[1, 6], [1, 2], [2, 4], [3, 5], [1, 3, 6], [2, 3, 6]]
c = b.reject.with_index do |a,i|
arr[0,i].any? { |aa| (aa.size < a.size) && (aa-a).empty? }
end
#=> [[1, 6], [1, 2], [2, 4], [3, 5], [2, 3, 6]]
Lastly, reorder c to correspond to the original ordering of the elements of arr.
c.sort_by { |a| order[a] }
#=> [[1, 6], [1, 2], [2, 4], [3, 5], [2, 3, 6]]
which in this case happens to be the same order as the elements of c.
Let's look more carefully at the calculation of c:
enum1 = b.reject
#=> #<Enumerator: [[1, 6], [1, 2], [2, 4], [3, 5], [1, 3, 6],
# [2, 3, 6]]:reject>
enum2 = enum1.with_index
#=> #<Enumerator: #<Enumerator: [[1, 6], [1, 2], [2, 4], [3, 5],
# [1, 3, 6], [2, 3, 6]]:reject>:with_index>
The first element is generated by the enumerator enum2 and passed to the block and assigned as values of the block variables:
a, i = enum2.next
#=> [[1, 6], 0]
a #=> [1, 6]
i #=> 0
The block calculation is then performed:
d = arr[0,i]
#=> []
d.any? { |aa| (aa.size < a.size) && (aa-a).empty? }
#=> false
so a[0] is not rejected. The next pair passed to the block by enum2 is [[1, 2], 1]. That value is retained as well, but let's skip ahead to the last element passed to the block by enum2:
a, i = enum2.next
#=> [[1, 2], 1]
a, i = enum2.next
#=> [[2, 4], 2]
a, i = enum2.next
#=> [[3, 5], 3]
a, i = enum2.next
#=> [[1, 3, 6], 4]
a #=> [1, 3, 6]
i #=> 4
Perform the block calculation:
d = arr[0,i]
#=> [[1, 6], [1, 2], [2, 4], [3, 5]]
d.any? { |aa| (aa.size < a.size) && (aa-a).empty? }
#=> true
As true is returned, a is rejected. In the last calculation the first element of d is passed to the block and the following calculation is performed:
aa = [1, 6]
(aa.size < a.size)
#=> 2 < 3 => true
(aa-a).empty?
#=> ([1, 6] - [1, 3, 6]).empty? => [].empty? => true
As true && true #=> true, a ([1, 3, 6]) is rejected.
Alternative calculation
The following is a closer match to the OP's Python equivalent, but less efficient:
def remove_super_arrays(arr)
arr.select do |a|
(arr-[a]).all? { |aa| aa.size > a.size || (aa-a).any? }
end
end
or
def remove_super_arrays(arr)
arr.reject do |a|
(arr-[a]).any? { |aa| (aa.size < a.size) && (aa-a).empty? }
end
end
This was a nice exercise for me. I have used the logic from here.
My code iterates over each subarray (except the first), then there is the magic substraction using the first index, when it is empty the other array contained both numbers.
def remove_super_arrays(arr)
arr.each_with_index.with_object([]) do |(sub_array, index), result|
next if index == 0
result << sub_array unless (arr.first - sub_array).empty?
end.unshift(arr.first)
end
arr = [[1,6],[1,2],[2,4],[3,5],[1,3,6],[2,3,6]]
p remove_super_arrays(arr)
#=> [[1, 6], [1, 2], [2, 4], [3, 5], [2, 3, 6]]

Why can't I use |a,b| instead of |(a,b)| in arr.map { |(a, b)| !b.nil? ? a + b : a }?

In the code below, arr is meant to be a two-dimensional array, such as [[1,2],[4,5]]. It computes the sum of the elements of the sub arrays. A subarray can have only one element, in which case the sum is just that one element.
def compute(arr)
return nil unless arr
arr.map { |(a, b)| !b.nil? ? a + b : a }
end
Why does the code have to be |(a, b)| instead of |a,b|?
What does (a,b) mean in Ruby?
You could use |a,b| too, it's nothing different from |(a,b)|.
You may also rewrite the code as below, which doesn't have the element number limit for the sub arrays:
arr.map { |a| a.inject{ |sum,x| sum + x } }
or even:
arr.map { |a| a.inject(:+) }
Both are equivalent if arr is an array:
arr = [[1, 2], [4, 5]]
arr.map { |a, b| [a, b] } #=> [[1, 2], [4, 5]]
arr.map { |(a, b)| [a, b] } #=> [[1, 2], [4, 5]]
This is because the block is called with a single argument at a time: the subarray. Something like:
yield [1, 2]
yield [4, 5]
This changes if more than one arguments is yielded. each_with_index for example, calls the block with two arguments: the item (i.e. the subarray) and its index. Something like:
yield [1, 2], 0
yield [4, 5], 1
The difference is obvious:
enum = [[1, 2], [4, 5]].each_with_index
enum.map { |a, b| [a, b] } #=> [[[1, 2], 0], [[4, 5], 1]]
enum.map { |(a, b)| [a, b] } #=> [[1, 2], [4, 5]]
Note that omitting parenthesis also allows you to set a default argument value:
arr = [[1, 2], [4]]
arr.map { |a, b = 0| a + b } #=> [3, 4]

How do I find the location of an integer in an array of arrays in ruby?

Given:
a = [[1,2,3,4],
[1,2,3,7],
[1,2,3,4]]
What do I need to do to output the location of the 7 as (1,3)?
I've tried using .index to no avail.
require 'matrix'
a = [[1, 2, 3, 4],
[1, 2, 3, 7],
[1, 2, 3, 4]]
Matrix[*a].index(7)
=> [1, 3]
If your sub-arrays are all the same width, you can flatten it into a single array and think of the position as row_num * row_width + col_num:
idx = a.flatten.index(7)
row_width = a[0].length
row = idx / row_width
col = idx - (row * row_width)
puts [row, col] # => [1, 3]
Or you could just iterate it to find all matches:
def find_indices_for(array, value)
array.with_object([]).with_index do |(row, matches), row_index|
matches << [row_index, row.index(value)] if row.index(value)
end
end
find_indices_for(a, 7) # => [[1, 3]]
find_indices_for(a, 2) # => [[0, 1], [1, 1], [2, 1]]
each_with_index works pretty well here:
def locator(array, number)
locations = Array.new
array.each_with_index do |mini_array, index|
mini_array.each_with_index do |element, sub_index|
locations << [index, sub_index] if element == number
end
end
locations
end
Now, locator(array, number) will return an array of containing all the locations of number in array.
def locs2D(a,e)
a.size.times.with_object([]) do |row,arr|
row.size.times { |col| arr << [row,col] if a[row][col] == e }
end
end
locs2D(a,7) #=> [[1, 3]]
locs2D(a,3) #=> [[0, 2], [1, 2], [2, 2]]

How to search within a two-dimensional array

I'm trying to learn how to search within a two-dimensional array; for example:
array = [[1,1], [1,2], [1,3], [2,1], [2,4], [2,5]]
I want to know how to search within the array for the arrays that are of the form [1, y] and then show what the other y numbers are: [1, 2, 3].
If anyone can help me understand how to search only with numbers (as a lot of the examples I found include strings or hashes) and even where to look for the right resources even, that would be helpful.
Ruby allows you to look into an element by using parentheses in the block argument. select and map only assign a single block argument, but you can look into the element:
array.select{|(x, y)| x == 1}
# => [[1, 1], [1, 2], [1, 3]]
array.select{|(x, y)| x == 1}.map{|(x, y)| y}
# => [1, 2, 3]
You can omit the parentheses that correspond to the entire expression between |...|:
array.select{|x, y| x == 1}
# => [[1, 1], [1, 2], [1, 3]]
array.select{|x, y| x == 1}.map{|x, y| y}
# => [1, 2, 3]
As a coding style, it is a custom to mark unused variables as _:
array.select{|x, _| x == 1}
# => [[1, 1], [1, 2], [1, 3]]
array.select{|x, _| x == 1}.map{|_, y| y}
# => [1, 2, 3]
You can use Array#select and Array#map methods:
array = [[1,1], [1,2], [1,3], [2,1], [2,4], [2,5]]
#=> [[1, 1], [1, 2], [1, 3], [2, 1], [2, 4], [2, 5]]
array.select { |el| el[0] == 1 }
#=> [[1, 1], [1, 2], [1, 3]]
array.select { |el| el[0] == 1 }.map {|el| el[1] }
#=> [1, 2, 3]
For more methods on arrays explore docs.
If you first select and then map you can use the grep function to to it all in one function:
p array.grep ->x{x[0]==1}, &:last #=> [1,2,3]
Another way of doing the same thing is to use Array#map together with Array#compact. This has the benefit of only requiring one block and a trivial operation, which makes it a bit easier to comprehend.
array.map { |a, b| a if b == 1 }
#=> [1, 2, 3, nil, nil, nil]
array.map { |a, b| a if b == 1 }.compact
#=> [1, 2, 3]
You can use each_with_object:
array.each_with_object([]) { |(x, y), a| a << y if x == 1 }
#=> [1, 2, 3]

Is there any way to create 2 x 2 array/matrix from a larger array/matrix

New to Ruby and learning at the moment.
I am not sure I should use arrays or matrix for this.
I have arrays
[['J','O','I','J','O'],
['I','J','O','J','O'],
['I','I','J','I','J']]
I want to find out the following as you can see in the image.
['J', 'O']
['I', 'J']
What I thought was using Ruby Matrix, but I am not sure if I can divide the original array/matrix to chunk of small 2 by 2 matrix and if it matches with [J, O], [I, J].
Or should I use array and loop through.
I appreciate any inputs.
I suggest you use the method Matrix#minor to do this.
Code
require 'matrix'
def find_in_matrix(arr,sub)
sub_nrows = sub.size
sub_ncols = sub.first.size
rows = Array[*0..arr.size-sub_nrows]
cols = Array[*0..arr.first.size-sub_ncols]
arr_m = Matrix[*arr]
sub_m = Matrix[*sub]
rows.product(cols).select {|i,j| arr_m.minor(i,sub_nrows,j,sub_ncols)==sub_m}
end
Example
arr = [['J','O','I','J','O'],
['I','J','O','J','O'],
['I','I','J','I','J']]
sub = [['J', 'O'],
['I', 'J']]
find_in_matrix(arr,sub) #=> [[0, 0], [1, 1], [1, 3]]
find_in_matrix(arr, [['O'], ['J']]) #=> [[0, 1], [1, 2], [1, 4]]
find_in_matrix(arr, [['O']]) #=> [[0, 1], [0, 4], [1, 2], [1, 4]]
find_in_matrix(arr, [['I','J','O']]) #=> [[0, 2], [1, 0]]
find_in_matrix(arr, [['I','J'],['J','O']]) #=> []
find_in_matrix(arr, [[]]) #=> [[0, 0], [0, 1],...,[0, 5]]
# [1, 0], [1, 1],...,[1, 5]]
# [2, 0], [2, 1],...,[2, 5]]
Explanation
For the example above:
sub_nrows = sub.size #=> 2
sub_ncols = sub.first.size #=> 2
rows = Array[*0..(arr.size-sub_nrows)] #=> [0, 1]
cols = Array[*0..(arr.first.size-sub_ncols)] #=> [0, 1, 2, 3]
arr_m = Matrix[*arr]
#=> Matrix[["J", "O", "I", "J", "O"], ["I", "J", "O", "J", "O"],
# ["I", "I", "J", "I", "J"]]
sub_m = Matrix[*sub]
#=> Matrix[["J", "O"], ["I", "J"]]
a = rows.product(cols)
#=> [[0, 0], [0, 1], [0, 2], [0, 3], [1, 0], [1, 1], [1, 2], [1, 3]]
a.select {|i,j| arr_m.minor(i,sub_nrows,j,sub_ncols)==sub_m}
#=> [[0, 0], [1, 1], [1, 3]]
Consider the first element of a that select passes into the block: [0, 0] (i.e., the block variables i and j are both assigned the value zero). We therefore compute:
arr_m.minor(i,sub_nrows,j,sub_ncols) #=> arr_m.minor(0,2,0,2)
#=> Matrix[["J", "O"], ["I", "J"]]
As
arr_m.minor(0,2,0,2) == sub_m
[0, 0] is selected. On the other hand, for the element [1, 2] of a, i => 1, j => 2, so:
arr_m.minor(i,sub_nrows,j,sub_ncols) #=> arr_m.minor(1,2,2,2)
#=> Matrix[["O", "J"], ["J", "I"]]
which does not equal sub_m, so the element [1, 2] is not selected.
Note that Matrix#minor has two forms. I used the form that takes four parameters. The other form takes two ranges as parameters.
Not directly to answer the question, and not an efficient way to get the matched position:
matrix = [
%w(J O I J O),
%w(I J O J O),
%w(I I J I J)
]
target = [
%w(J O),
%w(I J)
]
matrix.each_cons(target.length).each_with_index do |sub, row|
sub.map{|a| a.each_cons(target[0].length).to_a}.tap do |sub|
head = sub.shift
head.zip(*sub).each_with_index do |m, col|
if m == target
puts "#{row}, #{col}"
end
end
end
end
You can define the following:
def find_in_matrix(matrix, target)
(0..matrix.length-target.length).to_a.product(
(0..matrix.first.length-target.first.length).to_a).select do |x, y|
(0...target.length).to_a.product(
(0...target.first.length).to_a).all? do |test_x, test_y|
matrix[x+test_x][y+test_y] == target[test_x][test_y]
end
end
end
matrix = [["J", "O", "I", "J", "O"],
["I", "J", "O", "J", "O"],
["I", "I", "J", "I", "J"]]
target = [["J", "O"],
["I", "J"]]
find_in_matrix(matrix, target)
=> [[0, 0], [1, 1], [1, 3]]
This solution simply goes over all the sub-matrices of matrix with target's size, and selects the ones that are equal to it.

Resources