Without getting too much into biology, Proteins are made of Amino Acids. Each of the 20 Amino Acids that make up Proteins are represented by characters in a sequence. Each Amino Acid char has a different chemical formula, which I represent as strings. For example, "M" has a formula of "C5H11NO2S"
Given the 20 different formulas (and the varying frequency of each amino acid chars in a protein sequence) I want to compile all 20 of them into a single formula that will yield the total formula for the protein.
So first: multiply each formula by the frequency of its char in the sequence
Second : sum together all multiplied formulas into one formula.
To accomplish this, I first tried multiplying each amino acid char frequency in the sequence by the numbers in the chemical formula. I did this using .tally
sequence ="MGAAARTLRLALGLLLLATLLRPADACSCSPVHPQQAFCNADVVIRAKAVSEKEVDSGNDIYGNPIKRIQYEIKQIKMFKGPEKDIEFI"
sequence.chars.string.tally --> {"M"=>2, "G"=>5, "A"=>11, "R"=>5, "T"=>2, "L"=>9, "P"=>5, "D"=>5, "C"=>3, "S"=>4, "V"=>5, "H"=>1, "Q"=>4, "F"=>3, "N"=>3, "I"=>8, "K"=>7, "E"=>5, "Y"=>2}
Then, I listed all the amino acids chars and formulas into a hash
hash_of_formulas = {"A"=>"C3H7NO2", "R"=>"C6H14N4O2", "N"=>"C4H8N2O3", "D"=>"C4H7NO4", "C"=>"C3H7NO2S", "E"=>"C5H9NO4", "Q"=>"C5H10N2O3", "G"=>"C2H5NO2", "H"=>"C6H9N3O2", "I"=>"C6H13NO2", "L"=>"C6H13NO2", "K"=>"C6H14N2O2", "M"=>"C5H11NO2S", "F"=>"C9H11NO2", "P"=>"C5H9NO2", "S"=>"C3H7NO3", "T"=>"C4H9NO3", "W"=>"C11H12N2O2", "Y"=>"C9H11NO3", "V"=>"C5H11NO2"}
An example of what the process for my overall goal is:
In the sequence , "M" occurs twice so "C5H11NO2S" will become "C10H22N2O4S2". "C" has a formula of "C3H7NO2S" occurs 3 times: In the sequence so "C3H7NO2S" becomes "C9H21N3O6S3"
So, Summing together "C10H22N2O4S2" and "C9H21N3O6S3" will yield "C19H43N5O10S5"
How can I repeat the process of multiplying each formula by its frequency and then summing together all multiplied formulas?
I know that I could use regex for multiplying a formula by its frequency for an individual string using
formula_multiplied_by_frequency = "C5H11NO2S".gsub(/\d+/) { |x| x.to_i * 4}
But I'm not sure of any methods to use regex on strings embedded within hashes
If I understand correctly, you want the to provide the total formula for a given protein sequence. Here's how I'd do it:
NUCLEOTIDES = {"A"=>"C3H7NO2", "R"=>"C6H14N4O2", "N"=>"C4H8N2O3", "D"=>"C4H7NO4", "C"=>"C3H7NO2S", "E"=>"C5H9NO4", "Q"=>"C5H10N2O3", "G"=>"C2H5NO2", "H"=>"C6H9N3O2", "I"=>"C6H13NO2", "L"=>"C6H13NO2", "K"=>"C6H14N2O2", "M"=>"C5H11NO2S", "F"=>"C9H11NO2", "P"=>"C5H9NO2", "S"=>"C3H7NO3", "T"=>"C4H9NO3", "W"=>"C11H12N2O2", "Y"=>"C9H11NO3", "V"=>"C5H11NO2"}
NUCLEOTIDE_COMPOSITIONS = NUCLEOTIDES.each_with_object({}) { |(nucleotide, formula), compositions|
compositions[nucleotide] = formula.scan(/([A-Z][a-z]*)(\d*)/).map { |element, count| [element, count.empty? ? 1 : count.to_i] }.to_h
}
def formula(sequence)
sequence.each_char.with_object(Hash.new(0)) { |nucleotide, final_counts|
NUCLEOTIDE_COMPOSITIONS[nucleotide].each { |element, element_count|
final_counts[element] += element_count
}
}.map { |element, element_count|
"#{element}#{element_count.zero? ? "" : element_count}"
}.join
end
sequence = "MGAAARTLRLALGLLLLATLLRPADACSCSPVHPQQAFCNADVVIRAKAVSEKEVDSGNDIYGNPIKRIQYEIKQIKMFKGPEKDIEFI"
p formula(sequence)
# => "C434H888N51O213S"
You can't use regexp to multiply things. You can use it to parse a formula, but then it's on you and regular Ruby to do the math. The first job is to prepare a composition lookup by breaking down each nucleotide formula. Once we have a composition hash for each nucleotide, we can iterate over a nucleotide sequence, and add up all the elements of each nucleotide.
BTW, tally is not particularly useful here, since tally will need to iterate over the sequence, and then you have to iterate over tally anyway — and there is no aggregate operation going on that can't be done going over each letter independently.
EDIT: I probably made the regexp slightly more complicated that it needs to be, but it should parse stuff like CuSO4 correctly. I don't know if it's an accident or not that all nucleotides are only composed of elements with a single-character symbol... :P )
Givens
We are given a string representing a protein comprised of amino acids:
sequence = "MGAAARTLRLALGLLLLATLLRPADACSCSPVHPQQAFCNADVVIR" +
"AKAVSEKEVDSGNDIYGNPIKRIQYEIKQIKMFKGPEKDIEFI"
and a hash that contains the formulas of amino acids:
formulas = {
"A"=>"C3H7NO2", "R"=>"C6H14N4O2", "N"=>"C4H8N2O3", "D"=>"C4H7NO4",
"C"=>"C3H7NO2S", "E"=>"C5H9NO4", "Q"=>"C5H10N2O3", "G"=>"C2H5NO2",
"H"=>"C6H9N3O2", "I"=>"C6H13NO2", "L"=>"C6H13NO2", "K"=>"C6H14N2O2",
"M"=>"C5H11NO2S", "F"=>"C9H11NO2", "P"=>"C5H9NO2", "S"=>"C3H7NO3",
"T"=>"C4H9NO3", "W"=>"C11H12N2O2", "Y"=>"C9H11NO3", "V"=>"C5H11NO2"
}
Obtain counts of atoms in each amino acid
As a first step we can calculate the numbers of each atom in each amino acid:
counts = formulas.transform_values do |s|
s.scan(/[CHNOS]\d*/).
each_with_object({}) do |s,h|
h[s[0]] = s.size == 1 ? 1 : s[1..-1].to_i
end
end
#=> {"A"=>{"C"=>3, "H"=>7, "N"=>1, "O"=>2},
# "R"=>{"C"=>6, "H"=>14, "N"=>4, "O"=>2},
# ...
# "M"=>{"C"=>5, "H"=>11, "N"=>1, "O"=>2, "S"=>1}
# ...
# "V"=>{"C"=>5, "H"=>11, "N"=>1, "O"=>2}}
Compute formula for protein
Then it's simply:
def protein_formula(sequence, counts)
sequence.each_char.
with_object("C"=>0, "H"=>0, "N"=>0, "O"=>0, "S"=>0) do |c,h|
counts[c].each { |aa,cnt| h[aa] += cnt }
end.each_with_object('') { |(aa,nbr),s| s << "#{aa}#{nbr}" }
end
protein_formula(sequence, counts)
#=> "C434H888N120O213S5"
Another example:
protein_formula("MCMPCFTTDHQMARKCDDCCGGKGRGKCYGPQCLCR", count)
#=> "C158H326N52O83S11"
Explanation of calculation of counts
This calculation:
counts = formulas.transform_values do |s|
s.scan(/[CHNOS]\d*/).each_with_object({}) do |s,h|
h[s[0]] = s.size == 1 ? 1 : s[1..-1].to_i
end
end
uses the method Hash#transform_values. It will return a hash having the same keys as the hash formulas, with the values of those keys in formula modified by transform_values's block. For example, formulas["A"] ("C3H7NO2") is "transformed" to the hash {"C"=>3, "H"=>7, "N"=>1, "O"=>2} in the hash that is returned, counts.
transform_values passes each value of formulas to the block and sets the block variable equal to it. The first value passed is "C3H7NO2", so it sets:
s = "C3H7NO2"
We can write the block calculation more simply:
h = {}
s.scan(/[CHNOS]\d*/).each do |s|
h[s[0]] = s.size == 1 ? 1 : s[1..-1].to_i
end
h
(Once you understand this calculation, which I explain below, see Enumerable#each_with_object to understand why I used that method in my solution.)
After initializing h to an empty hash, the following calculations are performed:
h = {}
a = s.scan(/[CHNOS]\d*/)
#=> ["C3", "H7", "N", "O2"]
a is computed using String#scan with the regular expression /[CHNOS]\d*/. That regular expression, or regex, matches exactly one character in the character class [CHNOS] followed by zero of more (*) digits (\d). It therefore separates the string "C3H7NO2" into the substrings that are returned in the array shown under the calculation of a above . Continuing,
a.each do |s|
h[s[0]] = s.size == 1 ? 1 : s[1..-1].to_i
end
changes h to the following:
h #=> {"C"=>3, "H"=>7, "N"=>1, "O"=>2}
The block variable s is initially set equal to the first element of a that is passed to each's block:
s = "C3"
then we compute:
h[s[0]] = s.size == 1 ? 1 : s[1..-1].to_i
h["A"] = 2 == 1 ? 1 : "3".to_i
= false ? 1 : 3
3
This is repeated for each element of a.
Exclamation of construction of formula for the protein
We can simplify the following code1:
sequence.each_char.with_object("C"=>0, "H"=>0, "N"=>0, "O"=>0) do |c,h|
counts[c].each { |aa,cnt| h[aa] += cnt }
end.each_with_object('') { |(aa,nbr),s| s << "#{aa}#{nbr}" }
to more or less the following:
h = { "C"=>0, "H"=>0, "N"=>0, "O"=>0, "S"=>0 }
ch = sequence.chars
#=> ["M", "G", "A",..., "F", "I"]
ch.each do |c|
counts[c].each { |aa,cnt| h[aa] += cnt }
end
h #=> {"C"=>434, "H"=>888, "N"=>120, "O"=>213, "S"=>5}
When the first value of ch ("M") is passed to each's block (when h = { "C"=>0, "H"=>0, "N"=>0, "O"=>0, "S"=>0 }), the following calculations are performed:
c = "M"
g = counts[c]
#=> {"C"=>10, "H"=>22, "N"=>2, "O"=>4, "S"=>1}
g.each { |aa,cnt| h[aa] += cnt }
h #=> {"C"=>10, "H"=>22, "N"=>2, "O"=>4, "S"=>1}
Lastly, (when h #=> {"C"=>434, "H"=>888, "N"=>120, "O"=>213, "S"=>5})
s = ''
h.each { |aa,nbr| s << "#{aa}#{nbr}" }
s #=> "C434H888N120O213S5"
When aa = "C" and nbr = 434,
"#{aa}#{nbr}"
#=> "C434"
is appended to the string s.
1. (("C"=>0, "H"=>0, "N"=>0, "O"=>0) is shorthand for ({"C"=>0, "H"=>0, "N"=>0, "O"=>0}).
I am new to the ruby and was practicing a code. I want to count the letters in a string by a self written code, without using #length or #size method. I have searched online but am unable to find anything relating to my query. I would appreciate if anyone could help me out in this simple program.
Other option, mapping String#chars with index then picking the last:
str = "123456"
str.chars.map.with_index { |_, i| i + 1 }.last
#=> 6
It generates an Array, but we are not looking for efficiency here.
Or even using String#index with offset:
str = "aaaa"
str.index(str[-1], -1) + 1
#=> 4
It looks for the index of the latest char starting from the end.
You can do that using any String method that enumerates characters. The most obvious is String#each_char, as #knut mentioned in a comment.
def str_length(str)
enum = str.each_char
n = 0
loop do
enum.next
n += 1
end
n
end
str_length "Zaphod"
#=> 6
Let's see what is happening here.
str = "Zaphod"
enum = str.each_char
#=> #<Enumerator: "123456":each_char>
n = 0
loop do
s = enum.next
n += 1
puts "s = #{s}, n = #{n}"
end
n #=> 6
prints
s = Z, n = 1
s = a, n = 2
s = p, n = 3
s = h, n = 4
s = o, n = 5
s = d, n = 6
See Enumerator#next. After enum.next #=> "d" is executed enum.next is executed once more, raising a StopIteration exception. That exception is handled by Kernel#loop by breaking out of the loop.
As I said at the outset, any String method could be used that enumerates characters. For example, enum = str.gsub(/./).
The same approach could be used for any class that implements a method that enumerates elements of a collection. For example, we could add a method to the Enumerable module, which would then be available for every class that includes that module.
module Enumerable
def my_length
enum = each
n = 0
loop do
enum.next
n += 1
end
n
end
end
[1,2,3,4].my_length
#=> 4
{ a: 1, b: 2 }.my_length
#=> 2
(1..5).my_length
#=> 5
I have two huge arrays of sentences, one in German and one in English. I will search through the German sentences for sentences that contain a certain word and if they do, I will check if there is an equivalent English sentence (using a hash with connection information). However, if the user is looking for a very common word, I don't want to return every single sentence that contains it but only the first x matches and stop searching then.
If I do german_sentences.index { |sentence| sentence.include?(word) } I get only one match at a time.
If I use german_sentences.keep_if { |sentence| sentence.include?(word) } I get all matches, but also lose the index information, which is really critical for this.
I am now using a custom loop with each_with_index and break once the maximum has been reached, but I really feel that I must be missing some existing solution, at least something that gives a limited number of matches (even if not their indices)...
german_sentences
.each_index
.lazy
.select{|i| german_sentences[i].include?(word)}
.first(n)
If your need is not a one-off, you could use Module#refine, rather than monkeypatching Array). refine was added to v2.0 experimentally, then changed considerably in v. 2.1. One of the restrictions in the use of refine is: "You may only activate refinements at top-level...", which evidently prevents testing in Pry and IRB.
module M
refine Array do
def select_indices_first(n)
i = 0
k = 0
a = []
return a if n == 0
each { |x| (a << i; k += 1) if yield(x); break if k == n; i += 1 }
a
end
def select_first(n) # if you wanted this also...
k = 0
a = []
return a if n == 0
each { |x| (a << x; k += 1) if yield(x); break if k == n }
a
end
end
end
using M
sentences = ["How now brown", "Cat", "How to guide", "How to shop"]
sentences.select_indices_first(0) {|s| s.include?("How")} # => []
sentences.select_indices_first(1) {|s| s.include?("How")} # => [0]
sentences.select_indices_first(2) {|s| s.include?("How")} # => [0, 2]
sentences.select_indices_first(3) {|s| s.include?("How")} # => [0, 2, 3]
sentences.select_indices_first(99) {|s| s.include?("How")} # => [0, 2, 3]
sentences.select_first(2) {|s| s.include?("How")}
# => ["How now brown", "How to guide"]
I don't understand why, but I'm getting too many inserts and matches generated when I nest these two loops. Any help appreciated!
pseudocode
two arrays - nested for loops
search 2nd array for match of each element in 1st array
if there is a match in 2nd array, take the number after the match
insert number in 1st array after word that has been matched
end
problem code:
ary1 = ['a','b','c','d']
ary2 = ['e','f','g', 'a']
limit = ary1.count - 1
limit2 = ary2.count - 1
(0..limit).each do |i|
(0..limit2).each do |j|
if ary1[i] == ary2[j]
ary1.insert(i,ary2[j])
puts 'match!'
end
end
end
puts ary1
output:
match!
match!
match!
match!
a
a
a
a
a
b
c
d
provisional solution:
ary1 = ['a','b','c','d']
ary2 = ['e','f','g', 'a']
# have to make a copy to avoid excessive matches
ary_dup = Array.new(ary1)
limit = ary1.count - 1
limit2 = ary2.count - 1
(0..limit).each do |i|
(0..limit2).each do |j|
if ary1[i] == ary2[j]
ary_dup.insert(i,ary2[j])
puts 'match!'
end
end
end
puts ary_dup
output:
match!
a
a
b
c
d
Its happening because you're modifying array (ary1) under examination on the fly.
You could achieve desired result using this line of code -
(ary1 & ary2).each {|e| ary1.insert(ary1.index(e)+1,e)}
What it does is -
ary1 & ary2 returns an array which is intersection of two arrays - ary1 and ary2. In other words it'll contain all those elements that exist in both arrays.
.each and ensuing block traverses over this new array and inserts each element in ary1 at "index of original element" + 1
puts ary1 #=> ["a", "a", "b", "c", "d"]
The below part is not correcrt:
(0..limit).each do |i|
(0..limit2).each do |j|
if ary1[i] == ary2[j]
ary1.insert(i,ary2[j])
puts 'match!'
end
end
end
First pass:
ary1 = ['a','b','c','d']
ary2 = ['e','f','g', 'a']
when limit=0 and limit2 = 3,there is a match.ary1.insert(0,ary2[j]) line makes your array ary1 as ary1 = ['a','a','b','c','d']
Second pass:
ary1 = ['a','a',b','c','d']
ary2 = ['e','f','g', 'a']
when limit=1 and limit2 = 3,there is a match.ary1.insert(1,ary2[j]) line makes your array ary1 as ary1 = ['a','a','a','b','c','d'].
And it Goes on.. So as your arr1is having size 4, 4 a s has been added to ary1. Finally it becomes - [ a,a,a,a,a,b,c,d].
Array#insert says :-
Inserts the given values before the element with the given index.Negative indices count backwards from the end of the array, where -1 is the last element.