Pick item in array by percentage - ruby

I have an array which contains names and percentages.
Example:
[["JAMES", 3.318], ["JOHN", 3.271], ["ROBERT", 3.143]].
Now I have about a thousand of these names, and I'm trying to figure out how to choose a name randomly based on the percentage of the name (like how James as 3.318% and John as 3.271%), so that name will have that percentage of being picked (Robert will have a 3.143% of being picked). Help would be appreciated.

You can use max_by: (the docs contain a similar example)
array.max_by { |_, weight| rand ** 1.fdiv(weight) }
This assumes that your weights are actual percentages, i.e. 3.1% has to be expressed as 0.031. Or, if you don't want to adjust your weights:
array.max_by { |_, weight| rand ** 100.fdiv(weight) }
I'm using fdiv here to account for possible integer values. If your weights are always floats, you can also use /.

Even though I like #Stefan answer more than mine, I will contribute with a possible solution: I would distribute all my percentages along 100.0 so that they start from 0.0 and end to 100.0.
Imagine I have an array with the following percentages:
a = [10.5, 20.5, 17.8, 51.2]
where
a.sum = 100.0
We could write the following to distribute them along 100.0:
sum = 0.0
b = a.map { |el| sum += el }
and the result would be
b = [10.5, 31.0, 48.8, 100.0]
now I can generate a random number between 0.0 and 100.0:
r = rand(0.0..100.0) # or r = rand * 100.0
imagine r is 45.32.
I select the first element of b that is >= r`
idx = b.index { |el| el >= r }
which in our case would return 2.
Now you can select a[idx].
But I would go with #Stefan answer as well :)

I assume you will be drawing multiple random values, in which case efficiency is important. Moreover, I assume that all names are unique and all percentages are positive (i.e., that pairs with percentages of 0.0 have been removed).
You are given what amounts to a (discrete) probability density function (PDF). The first step is to convert that to a cumulative density function (CDF).
Suppose we are given the following array (whose percentages sum to 100).
arr = [["LOIS", 28.16], ["JAMES", 22.11], ["JOHN", 32.71], ["ROBERT", 17.02]]
First, separate the names from the percentages.
names, probs = arr.transpose
#=> [["LOIS", "JAMES", "JOHN", "ROBERT"],
# [28.16, 22.11, 32.71, 17.02]]
Next compute the CDF.
cdf = probs.drop(1).
each_with_object([0.01 * probs.first]) { |pdf, cdf|
cdf << 0.01 * pdf + cdf.last }
#=> [0.2816, 0.5027, 0.8298, 1.0]
The idea is that we will generate a (pseudo) random number between zero and one, r and find the first value c of the CDF for which r <= c.1 To do this in an efficient way we will perform an intelligent search of the CDF. This is possible because the CDF is an increasing function.
I will do a binary search, using Array#bsearch_index. This method is essentially the same as Array#bseach (whose doc is the relevant one), except the index of cdf is returned rather than the element of cdf is randomly selected. It will shortly be evident why we want the index.
r = rand
#=> 0.6257547400776025
idx = cdf.bsearch_index { |c| r <= c }
#=> 2
Note that we cannot write cdf.bsearch_index { |c| rand <= c } as rand would be executed each time the block is evaluated.
The randomly-selected name is therefore2
names[idx]
#=> "JOHN"
Now let's put all this together.
def setup(arr)
#names, probs = arr.transpose
#cdf = probs.drop(1).
each_with_object([0.01*probs.first]) { |pdf, cdf| cdf << 0.01 * pdf + cdf.last }
end
def random_name
r = rand
#names[#cdf.bsearch_index { |c| r <= c }]
end
Let's try it. Execute setup to compute the instance variables #names and #cdf.
setup(arr)
#names
#=> ["LOIS", "JAMES", "JOHN", "ROBERT"]
#cdf
#=> [0.2816, 0.5027, 0.8298, 1.0]
and then call random_name each time a random name is wanted.
5.times.map { random_name }
#=> ["JOHN", "LOIS", "JAMES", "LOIS", "JAMES"]
1. This is how most discrete random variates are generated in simulation models.
2. Had I used bsearch rather than bsearch_index I would have had to earlier create a hash with cdf=>name key-value pairs in order to retrieve a name for a given randomly-selected CDF value.

This is my solution to the problem:
array = [["name1", 33],["name2", 20],["name3",10],["name4",7],["name5", 30]]
def random_name(array)
random_number = rand(0.000..100.000)
sum = 0
array.each do |x|
if random_number.between?(sum, sum + x[1])
return x[0]
else
sum += x[1]
end
end
end
puts random_name(array)

Related

Generating a series of non-random numbers with a maximum sum of 100

I am working on a program that has a component that will generate simulated demographic numbers for various hypothetical jurisdictions.
The methods I have set up to generate each subset of demographics are dependent on some other variables, but generally, most jurisdictions are supposed to look something like:
White - 65%
Latino - 20%
African-American - 10%
Other - 5%
Of course, this isn't always the case. In some scenarios, white may be well under 50% with either Latino or AA being the most significant, but those are more edge cases. But in general that's usually about the balance.
So I am trying to figure out how to generate each demographic, which again is fed from different variables, mostly independently, but ensuring the number always adds up to 100.
I had thought about generating white first, since it's typically the largest, and then just creating a generator where Latino% = 100 - white%*.35 (.35 is just an example here), and so forth, but this creates a situation in which white would always be the plurality, which I don't necessarily want to happen.
So I am a bit stuck. I imagine this is as much a math problem as a Ruby problem. As a non-math person (who, as they have delved into programming, wishes they had paid better attention in class), I'd appreciate any guidance here.
Thank you!
First specify a cumulative distribution function (CDF).
DIST = { white: 0.65, latino: 0.85, aa: 0.95, other: 1.00 }
Note that
DIST[:white] - 0 #=> 0.65
DIST[:latino] - DIST[:white] #=> 0.20
DIST[:aa] - DIST[:latino] #=> 0.10
DIST[:other] - DIST[:aa] #=> 0.05
Now create a method to (pseudo-) randomly select one person from the population and return their ethnicity.
def select_one
rn = rand
DIST.find { |_k, v| rn <= v }.first
end
Try it.
10.times { p select_one }
:white
:aa
:latino
:white
:white
:white
:white
:white
:white
:latino
Now write a method to return a random sample of size n.
def draw_sample(n)
n.times.with_object(Hash.new(0)) { |_, h| h[select_one] += 1 }
end
Try it.
10.times { p draw_sample(100) }
{:white=>66, :latino=>21, :aa=>9, :other=>4}
{:white=>72, :latino=>14, :aa=>11, :other=>3}
{:white=>61, :latino=>19, :aa=>14, :other=>6}
{:white=>64, :latino=>25, :aa=>8, :other=>3}
{:white=>69, :latino=>19, :aa=>4, :other=>8}
{:white=>68, :latino=>17, :aa=>9, :other=>6}
{:white=>68, :latino=>16, :aa=>12 :other=>4}
{:white=>51, :latino=>27, :aa=>10, :other=>12}
{:white=>69, :latino=>23, :aa=>6, :other=>2}
{:white=>63, :latino=>19, :aa=>14, :other=>4}
(Note the order of the keys above varied; I reordered them to improve readability.)
On could alternatively write
def draw_sample(n)
n.times.map { select_one }.tally
end
though this has the disadvantage that it creates an intermediate array.
See Kernel#rand, the form of Hash::new that takes an argument (the default value, here zero) and Enumerable#tally.
From what I understand, each demographic depends on some external variables. What you could do then is
whites = getWhites(); // could be anything really
aa = getAA();
latinos = getLatinos();
sum = whites + aa + latinos;
whites = whites / sum * 100;
aa = aa / sum * 100;
latinos = latinos / sum * 100;
This guarantees that they always sum up to 100
Edit: The code is pseudocode (not ruby), assuming floating point data type

Using regular expressions to multiply and sum numeric string characters contained in a hash of mixed numeric strings

Without getting too much into biology, Proteins are made of Amino Acids. Each of the 20 Amino Acids that make up Proteins are represented by characters in a sequence. Each Amino Acid char has a different chemical formula, which I represent as strings. For example, "M" has a formula of "C5H11NO2S"
Given the 20 different formulas (and the varying frequency of each amino acid chars in a protein sequence) I want to compile all 20 of them into a single formula that will yield the total formula for the protein.
So first: multiply each formula by the frequency of its char in the sequence
Second : sum together all multiplied formulas into one formula.
To accomplish this, I first tried multiplying each amino acid char frequency in the sequence by the numbers in the chemical formula. I did this using .tally
sequence ="MGAAARTLRLALGLLLLATLLRPADACSCSPVHPQQAFCNADVVIRAKAVSEKEVDSGNDIYGNPIKRIQYEIKQIKMFKGPEKDIEFI"
sequence.chars.string.tally --> {"M"=>2, "G"=>5, "A"=>11, "R"=>5, "T"=>2, "L"=>9, "P"=>5, "D"=>5, "C"=>3, "S"=>4, "V"=>5, "H"=>1, "Q"=>4, "F"=>3, "N"=>3, "I"=>8, "K"=>7, "E"=>5, "Y"=>2}
Then, I listed all the amino acids chars and formulas into a hash
hash_of_formulas = {"A"=>"C3H7NO2", "R"=>"C6H14N4O2", "N"=>"C4H8N2O3", "D"=>"C4H7NO4", "C"=>"C3H7NO2S", "E"=>"C5H9NO4", "Q"=>"C5H10N2O3", "G"=>"C2H5NO2", "H"=>"C6H9N3O2", "I"=>"C6H13NO2", "L"=>"C6H13NO2", "K"=>"C6H14N2O2", "M"=>"C5H11NO2S", "F"=>"C9H11NO2", "P"=>"C5H9NO2", "S"=>"C3H7NO3", "T"=>"C4H9NO3", "W"=>"C11H12N2O2", "Y"=>"C9H11NO3", "V"=>"C5H11NO2"}
An example of what the process for my overall goal is:
In the sequence , "M" occurs twice so "C5H11NO2S" will become "C10H22N2O4S2". "C" has a formula of "C3H7NO2S" occurs 3 times: In the sequence so "C3H7NO2S" becomes "C9H21N3O6S3"
So, Summing together "C10H22N2O4S2" and "C9H21N3O6S3" will yield "C19H43N5O10S5"
How can I repeat the process of multiplying each formula by its frequency and then summing together all multiplied formulas?
I know that I could use regex for multiplying a formula by its frequency for an individual string using
formula_multiplied_by_frequency = "C5H11NO2S".gsub(/\d+/) { |x| x.to_i * 4}
But I'm not sure of any methods to use regex on strings embedded within hashes
If I understand correctly, you want the to provide the total formula for a given protein sequence. Here's how I'd do it:
NUCLEOTIDES = {"A"=>"C3H7NO2", "R"=>"C6H14N4O2", "N"=>"C4H8N2O3", "D"=>"C4H7NO4", "C"=>"C3H7NO2S", "E"=>"C5H9NO4", "Q"=>"C5H10N2O3", "G"=>"C2H5NO2", "H"=>"C6H9N3O2", "I"=>"C6H13NO2", "L"=>"C6H13NO2", "K"=>"C6H14N2O2", "M"=>"C5H11NO2S", "F"=>"C9H11NO2", "P"=>"C5H9NO2", "S"=>"C3H7NO3", "T"=>"C4H9NO3", "W"=>"C11H12N2O2", "Y"=>"C9H11NO3", "V"=>"C5H11NO2"}
NUCLEOTIDE_COMPOSITIONS = NUCLEOTIDES.each_with_object({}) { |(nucleotide, formula), compositions|
compositions[nucleotide] = formula.scan(/([A-Z][a-z]*)(\d*)/).map { |element, count| [element, count.empty? ? 1 : count.to_i] }.to_h
}
def formula(sequence)
sequence.each_char.with_object(Hash.new(0)) { |nucleotide, final_counts|
NUCLEOTIDE_COMPOSITIONS[nucleotide].each { |element, element_count|
final_counts[element] += element_count
}
}.map { |element, element_count|
"#{element}#{element_count.zero? ? "" : element_count}"
}.join
end
sequence = "MGAAARTLRLALGLLLLATLLRPADACSCSPVHPQQAFCNADVVIRAKAVSEKEVDSGNDIYGNPIKRIQYEIKQIKMFKGPEKDIEFI"
p formula(sequence)
# => "C434H888N51O213S"
You can't use regexp to multiply things. You can use it to parse a formula, but then it's on you and regular Ruby to do the math. The first job is to prepare a composition lookup by breaking down each nucleotide formula. Once we have a composition hash for each nucleotide, we can iterate over a nucleotide sequence, and add up all the elements of each nucleotide.
BTW, tally is not particularly useful here, since tally will need to iterate over the sequence, and then you have to iterate over tally anyway — and there is no aggregate operation going on that can't be done going over each letter independently.
EDIT: I probably made the regexp slightly more complicated that it needs to be, but it should parse stuff like CuSO4 correctly. I don't know if it's an accident or not that all nucleotides are only composed of elements with a single-character symbol... :P )
Givens
We are given a string representing a protein comprised of amino acids:
sequence = "MGAAARTLRLALGLLLLATLLRPADACSCSPVHPQQAFCNADVVIR" +
"AKAVSEKEVDSGNDIYGNPIKRIQYEIKQIKMFKGPEKDIEFI"
and a hash that contains the formulas of amino acids:
formulas = {
"A"=>"C3H7NO2", "R"=>"C6H14N4O2", "N"=>"C4H8N2O3", "D"=>"C4H7NO4",
"C"=>"C3H7NO2S", "E"=>"C5H9NO4", "Q"=>"C5H10N2O3", "G"=>"C2H5NO2",
"H"=>"C6H9N3O2", "I"=>"C6H13NO2", "L"=>"C6H13NO2", "K"=>"C6H14N2O2",
"M"=>"C5H11NO2S", "F"=>"C9H11NO2", "P"=>"C5H9NO2", "S"=>"C3H7NO3",
"T"=>"C4H9NO3", "W"=>"C11H12N2O2", "Y"=>"C9H11NO3", "V"=>"C5H11NO2"
}
Obtain counts of atoms in each amino acid
As a first step we can calculate the numbers of each atom in each amino acid:
counts = formulas.transform_values do |s|
s.scan(/[CHNOS]\d*/).
each_with_object({}) do |s,h|
h[s[0]] = s.size == 1 ? 1 : s[1..-1].to_i
end
end
#=> {"A"=>{"C"=>3, "H"=>7, "N"=>1, "O"=>2},
# "R"=>{"C"=>6, "H"=>14, "N"=>4, "O"=>2},
# ...
# "M"=>{"C"=>5, "H"=>11, "N"=>1, "O"=>2, "S"=>1}
# ...
# "V"=>{"C"=>5, "H"=>11, "N"=>1, "O"=>2}}
Compute formula for protein
Then it's simply:
def protein_formula(sequence, counts)
sequence.each_char.
with_object("C"=>0, "H"=>0, "N"=>0, "O"=>0, "S"=>0) do |c,h|
counts[c].each { |aa,cnt| h[aa] += cnt }
end.each_with_object('') { |(aa,nbr),s| s << "#{aa}#{nbr}" }
end
protein_formula(sequence, counts)
#=> "C434H888N120O213S5"
Another example:
protein_formula("MCMPCFTTDHQMARKCDDCCGGKGRGKCYGPQCLCR", count)
#=> "C158H326N52O83S11"
Explanation of calculation of counts
This calculation:
counts = formulas.transform_values do |s|
s.scan(/[CHNOS]\d*/).each_with_object({}) do |s,h|
h[s[0]] = s.size == 1 ? 1 : s[1..-1].to_i
end
end
uses the method Hash#transform_values. It will return a hash having the same keys as the hash formulas, with the values of those keys in formula modified by transform_values's block. For example, formulas["A"] ("C3H7NO2") is "transformed" to the hash {"C"=>3, "H"=>7, "N"=>1, "O"=>2} in the hash that is returned, counts.
transform_values passes each value of formulas to the block and sets the block variable equal to it. The first value passed is "C3H7NO2", so it sets:
s = "C3H7NO2"
We can write the block calculation more simply:
h = {}
s.scan(/[CHNOS]\d*/).each do |s|
h[s[0]] = s.size == 1 ? 1 : s[1..-1].to_i
end
h
(Once you understand this calculation, which I explain below, see Enumerable#each_with_object to understand why I used that method in my solution.)
After initializing h to an empty hash, the following calculations are performed:
h = {}
a = s.scan(/[CHNOS]\d*/)
#=> ["C3", "H7", "N", "O2"]
a is computed using String#scan with the regular expression /[CHNOS]\d*/. That regular expression, or regex, matches exactly one character in the character class [CHNOS] followed by zero of more (*) digits (\d). It therefore separates the string "C3H7NO2" into the substrings that are returned in the array shown under the calculation of a above . Continuing,
a.each do |s|
h[s[0]] = s.size == 1 ? 1 : s[1..-1].to_i
end
changes h to the following:
h #=> {"C"=>3, "H"=>7, "N"=>1, "O"=>2}
The block variable s is initially set equal to the first element of a that is passed to each's block:
s = "C3"
then we compute:
h[s[0]] = s.size == 1 ? 1 : s[1..-1].to_i
h["A"] = 2 == 1 ? 1 : "3".to_i
= false ? 1 : 3
3
This is repeated for each element of a.
Exclamation of construction of formula for the protein
We can simplify the following code1:
sequence.each_char.with_object("C"=>0, "H"=>0, "N"=>0, "O"=>0) do |c,h|
counts[c].each { |aa,cnt| h[aa] += cnt }
end.each_with_object('') { |(aa,nbr),s| s << "#{aa}#{nbr}" }
to more or less the following:
h = { "C"=>0, "H"=>0, "N"=>0, "O"=>0, "S"=>0 }
ch = sequence.chars
#=> ["M", "G", "A",..., "F", "I"]
ch.each do |c|
counts[c].each { |aa,cnt| h[aa] += cnt }
end
h #=> {"C"=>434, "H"=>888, "N"=>120, "O"=>213, "S"=>5}
When the first value of ch ("M") is passed to each's block (when h = { "C"=>0, "H"=>0, "N"=>0, "O"=>0, "S"=>0 }), the following calculations are performed:
c = "M"
g = counts[c]
#=> {"C"=>10, "H"=>22, "N"=>2, "O"=>4, "S"=>1}
g.each { |aa,cnt| h[aa] += cnt }
h #=> {"C"=>10, "H"=>22, "N"=>2, "O"=>4, "S"=>1}
Lastly, (when h #=> {"C"=>434, "H"=>888, "N"=>120, "O"=>213, "S"=>5})
s = ''
h.each { |aa,nbr| s << "#{aa}#{nbr}" }
s #=> "C434H888N120O213S5"
When aa = "C" and nbr = 434,
"#{aa}#{nbr}"
#=> "C434"
is appended to the string s.
1. (("C"=>0, "H"=>0, "N"=>0, "O"=>0) is shorthand for ({"C"=>0, "H"=>0, "N"=>0, "O"=>0}).

How do I select a random key from a hash using the probability distribution stored within the corresponding values?

When I create a new opal, I want to randomly assign it one of many possible features. However, I want some qualities to be more common than others. I have a hash with possible features and their relative probability (out of a total of 1).
How do I choose a feature at random, but weighted according to the probability?
'possible_features':
{
'white_pin_fire_green': '0.00138',
'white_pin_fire_blue': '0.00138',
'white_pin_fire_yellow': '0.00144',
'white_pin_fire_purple': '0.00144',
'white_pin_fire_pink': '0.00036',
'white_straw_green': '0.01196',
'white_straw_blue': '0.01196',
'white_straw_yellow': '0.01248',
'white_straw_purple': '0.01248',
'white_straw_pink': '0.00312',
'white_ribbon_green': '0.01196',
'white_ribbon_blue': '0.01196',
'white_ribbon_yellow': '0.01248',
'white_ribbon_purple': '0.01248',
'white_ribbon_pink': '0.00312',
'white_harlequin_green': '0.0069',
'white_harlequin_blue': '0.0069',
'white_harlequin_yellow': '0.0072',
'white_harlequin_purple': '0.0072',
'white_harlequin_pink': '0.0018',
'white_no_fire': '0.06',
'black_pin_fire_green': '0.00552',
'black_pin_fire_blue': '0.00552',
'black_pin_fire_yellow': '0.00576',
'black_pin_fire_purple': '0.00576',
'black_pin_fire_pink': '0.00144',
'black_straw_green': '0.04784',
'black_straw_blue': '0.04784',
'black_straw_yellow': '0.04992',
'black_straw_purple': '0.04992',
'black_straw_pink': '0.01248',
'black_ribbon_green': '0.04784',
'black_ribbon_blue': '0.04784',
'black_ribbon_yellow': '0.04992',
'black_ribbon_purple': '0.04992',
'black_ribbon_pink': '0.01248',
'black_harlequin_green': '0.0276',
'black_harlequin_blue': '0.0276',
'black_harlequin_yellow': '0.0288',
'black_harlequin_purple': '0.0288',
'black_harlequin_pink': '0.0072',
'black_no_fire': '0.24'
}
For example, if I randomly generate 100 opals, I'd like for approximately 24 of them to have the "black_no_fire" feature.
Thank you for any help!
If I can assume that the hash values do indeed add up to exactly 1.0, then the solution is little simpler. (Otherwise, this approach would still work, but requires a little extra effort to first sum all the values - and use them as a weighting, but not a direct probability.)
First, let's choose a random value between 0 and 1, to represent a "fair selection". You may wish to use SecureRandom.random_number in your implementation.
Then, I loop through the possibilities, seeing when the cumulative sum reaches the chosen value.
possible_features = {
white_pin_fire_green: "0.00138",
white_pin_fire_blue: "0.00138",
# ...
}
r = rand
possible_features.find { |choice, probability| (r -= probability.to_f) <= 0 }.first
This effectively treats each possibility as covering a range: 0 .. 0.00138, 0.00138 .. 0.00276, 0.00276 .. 0.00420, ..., 0.76 .. 1.
Since the original random value (r) is was chosen from an even distribution, its value will lie within one of those ranges with the desired weighted probability.
Suppose your hash were as follows.
pdf = {
white_pin_fire_green: 0.21,
white_pin_fire_blue: 0.25,
white_pin_fire_yellow: 0.23,
white_pin_fire_purple: 0.16,
white_pin_fire_pink: 0.15
}
pdf.values.sum
#=> 1.0
I've made the values floats rather than strings merely to avoid the need for a boring conversion. Note that the keys, which are symbols, do not require single quotes here.
We can assume that all of the values in pdf are positive, as any that are zero can be removed.
Let's first create a method that converts pdf (probability density function) to cdf (cumulative probability distribution).
def pdf_to_cdf(pdf)
cum = 0.0
pdf.each_with_object({}) do |(k,v),h|
cum += v
h[cum] = k
end
end
cdf = pdf_to_cdf(pdf)
#=> {0.21=>:white_pin_fire_green,
# 0.45999999999999996=>:white_pin_fire_blue,
# 0.69=>:white_pin_fire_yellow,
# 0.85=>:white_pin_fire_purple,
# 1.0=>:white_pin_fire_pink}
Yes, I've inverted the cdf by flipping the keys and values. That's not a problem, since all pdf values are positive, and it's more convenient this way, for reasons to be explained.
For convenience let's now create an array of cdf's keys.
cdf_keys = cdf.keys
#=> [0.21, 0.46, 0.69, 0.85, 1.0]
We sample a single probability-weighted value by generating a (pseudo-) random number p between 0.0 and 1.0 (e.g., p = rand #=> 0.793292984248818) and then determine the smallest index i for which
cdf_keys[i] >= p
Suppose p = 0.65. then
cum_prob = cdf_keys.find { |cum_prob| cum_prob >= 0.65 }
#=> 0.69
Note that, because cdf_keys is an increasing sequence the operation
cum_prob = cdf_keys.find { |cum_prob| cum_prob >= rand }
could be sped up by using Array#bsearch.
So we select
selection = cdf[cum_prob]
#=> :white_pin_fire_yellow
Note that the probability that rand will be between 0.46 and 0.69 equals 0.69 - 0.46 = 0.23, which, by construction, is the desired probability of selecting :white_pin_fire_yellow.
If we wish to sample additional values "with replacement", we simply generate additional random numbers between zero and one and repeat the above calculations.
If we wish to sample additional values "without replacement" (no repeated selections), we must first remove the element just drawn from the pdf. First, however, let's note the probability of selection:
selection_prob = pdf[selection]
#=> 0.23
Now delete selection from pdf.
pdf.delete(:white_pin_fire_yellow)
pdf
#=> {:white_pin_fire_green=>0.21,
# :white_pin_fire_blue=>0.25,
# :white_pin_fire_purple=>0.16,
# :white_pin_fire_pink=>0.15}
As pdf.values.sum #=> 0.77 we must normalize the values so they sum to 1.0. To do that we don't actually have to sum the values as that sum equals
adj = 1.0 - selection_prob
#=> 1.0 - 0.23 => 0.77
Now normalize the new pdf:
pdf.each_key { |k| pdf[k] = pdf[k]/adj }
#=> {:white_pin_fire_green=>0.2727272727272727,
# :white_pin_fire_blue=>0.3246753246753247,
# :white_pin_fire_purple=>0.20779220779220778,
# :white_pin_fire_pink=>0.1948051948051948}
pdf.values.sum
#=> 1.0
We now repeat the steps described above when selecting the first element at random (construct cdf, generate a random number between zero and one, and so on).

Find element(s) closest to average of array

What would be a 'ruby' way to do the following; I'm still thinking in more imperative style programming and not really adapting to thinking in ruby. What I want to do is find the closest element in size to the average of an array, for example, consider the following array
[1,2,3]
The average is 2.0. The method I want to write returns the element closest to the average from above and below it, in this case 1 and 3.
Another example will illustrate this better:
[10,20,50,33,22] avg is 27.0 method would return 22 and 33.
This is not the most efficient, but it is (in my humble opinion) rather Ruby-esque.
class Array
# Return the single element in the array closest to the average value
def closest_to_average
avg = inject(0.0,:+) / length
min_by{ |v| (v-avg).abs }
end
end
[1,2,3].closest_to_average
#=> 2
[10,20,50,33,22].closest_to_average
#=> 22
If you really want the n closest items, then:
class Array
# Return a number of elements in the array closest to the average value
def closest_to_average(results=1)
avg = inject(0.0,:+) / length
sort_by{ |v| (v-avg).abs }[0,results]
end
end
[10,20,50,33,22].closest_to_average #=> [22]
[10,20,50,33,22].closest_to_average(2) #=> [22, 33]
[10,20,50,33,22].closest_to_average(3) #=> [22, 33, 20]
How this Works
avg = inject(0.0,:+) / length
is shorthand for:
avg = self.inject(0.0){ |sum,n| sum+n } / self.length
I start off with a value of 0.0 instead of 0 to ensure that the sum will be a floating point number, so that dividing by the length does not give me an integer-rounded value.
sort_by{ |v| (v-avg).abs }
sorts the array based on the difference between the number and average (lowest to highest), and then:
[0,results]
selects the first results number of entries from that array.
I assume that what is desired is the largest element of the array that is smaller than the average and the smallest value of the array that is larger than the average. Such values exist if and only if the array has at least two elements and they are not all the same. Assuming that condition applies, we need only convert it from words to symbols:
avg = a.reduce(:+)/a.size.to_f
[ a.select { |e| e < avg }.max, a.select { |e| e > avg }.min ]
Another way, somewhat less efficient:
avg = a.reduce(:+)/a.size.to_f
b = (a + [avg]).uniq.sort
i = b.index(avg)
[ b[i-1], b[i+1] ]

Grouping numbers for a histogram

I have a bunch of numbers I want to use to generate a histogram for a standard score.
Therefore I compute the mean and the standard deviation of the numbers and normalize each x with this formula
x' = (x-mean)/std_dev
The result is a number between -4 and 4. I want to chart that result. I am looking for a way to group the numbers in order to avoid to small bars.
My plan is to have bins in the interval [-4,4] centered at consecutavice quarter units, i.e [-4,-3.75,...,3.75,4]
Example: 0.1 => bin "0.0", 0.3 => bin "0.25", -1.3 => Bin "-1.5"
What is the best way to achieve that?
Here's a solution that doesn't use any third part libraries. The numbers should be in the Array vals.
MULTIPLIER = 0.25
multipliers = []
0.step(1, MULTIPLIER) { |n| multipliers << n }
histogram = Hash.new 0
# find the appropriate "bin" and create the histogram
vals.each do |val|
# create an array with all the residuals and select the smallest
cmp = multipliers.map { |group| [group, (group - val%1).abs] }
bin = cmp.min { |a, b| a.last <=> b.last }.first
histogram[val.truncate + bin] += 1
end
I think that it performs the proper rounding. But I only tried it with:
vals = Array.new(10000) { (rand * 10) % 4 * (rand(2) == 0 ? 1 : -1) }
and the distribution got kind of skewed, but that's probably the random number generator's fault.
Rails provides Enumerable#group_by -- see source here, assuming you're not using Rails: http://api.rubyonrails.org/classes/Enumerable.html
Assuming your list is called xs, you could do something like the following (untested):
bars = xs.group_by {|x| #determine bin here}
Then you'll have a hash that looks like:
bars = { 0 => [elements,in,first,bin], 1 => [elements,in,second,bin], etc }

Resources