Grouping numbers for a histogram - ruby

I have a bunch of numbers I want to use to generate a histogram for a standard score.
Therefore I compute the mean and the standard deviation of the numbers and normalize each x with this formula
x' = (x-mean)/std_dev
The result is a number between -4 and 4. I want to chart that result. I am looking for a way to group the numbers in order to avoid to small bars.
My plan is to have bins in the interval [-4,4] centered at consecutavice quarter units, i.e [-4,-3.75,...,3.75,4]
Example: 0.1 => bin "0.0", 0.3 => bin "0.25", -1.3 => Bin "-1.5"
What is the best way to achieve that?

Here's a solution that doesn't use any third part libraries. The numbers should be in the Array vals.
MULTIPLIER = 0.25
multipliers = []
0.step(1, MULTIPLIER) { |n| multipliers << n }
histogram = Hash.new 0
# find the appropriate "bin" and create the histogram
vals.each do |val|
# create an array with all the residuals and select the smallest
cmp = multipliers.map { |group| [group, (group - val%1).abs] }
bin = cmp.min { |a, b| a.last <=> b.last }.first
histogram[val.truncate + bin] += 1
end
I think that it performs the proper rounding. But I only tried it with:
vals = Array.new(10000) { (rand * 10) % 4 * (rand(2) == 0 ? 1 : -1) }
and the distribution got kind of skewed, but that's probably the random number generator's fault.

Rails provides Enumerable#group_by -- see source here, assuming you're not using Rails: http://api.rubyonrails.org/classes/Enumerable.html
Assuming your list is called xs, you could do something like the following (untested):
bars = xs.group_by {|x| #determine bin here}
Then you'll have a hash that looks like:
bars = { 0 => [elements,in,first,bin], 1 => [elements,in,second,bin], etc }

Related

Generating a series of non-random numbers with a maximum sum of 100

I am working on a program that has a component that will generate simulated demographic numbers for various hypothetical jurisdictions.
The methods I have set up to generate each subset of demographics are dependent on some other variables, but generally, most jurisdictions are supposed to look something like:
White - 65%
Latino - 20%
African-American - 10%
Other - 5%
Of course, this isn't always the case. In some scenarios, white may be well under 50% with either Latino or AA being the most significant, but those are more edge cases. But in general that's usually about the balance.
So I am trying to figure out how to generate each demographic, which again is fed from different variables, mostly independently, but ensuring the number always adds up to 100.
I had thought about generating white first, since it's typically the largest, and then just creating a generator where Latino% = 100 - white%*.35 (.35 is just an example here), and so forth, but this creates a situation in which white would always be the plurality, which I don't necessarily want to happen.
So I am a bit stuck. I imagine this is as much a math problem as a Ruby problem. As a non-math person (who, as they have delved into programming, wishes they had paid better attention in class), I'd appreciate any guidance here.
Thank you!
First specify a cumulative distribution function (CDF).
DIST = { white: 0.65, latino: 0.85, aa: 0.95, other: 1.00 }
Note that
DIST[:white] - 0 #=> 0.65
DIST[:latino] - DIST[:white] #=> 0.20
DIST[:aa] - DIST[:latino] #=> 0.10
DIST[:other] - DIST[:aa] #=> 0.05
Now create a method to (pseudo-) randomly select one person from the population and return their ethnicity.
def select_one
rn = rand
DIST.find { |_k, v| rn <= v }.first
end
Try it.
10.times { p select_one }
:white
:aa
:latino
:white
:white
:white
:white
:white
:white
:latino
Now write a method to return a random sample of size n.
def draw_sample(n)
n.times.with_object(Hash.new(0)) { |_, h| h[select_one] += 1 }
end
Try it.
10.times { p draw_sample(100) }
{:white=>66, :latino=>21, :aa=>9, :other=>4}
{:white=>72, :latino=>14, :aa=>11, :other=>3}
{:white=>61, :latino=>19, :aa=>14, :other=>6}
{:white=>64, :latino=>25, :aa=>8, :other=>3}
{:white=>69, :latino=>19, :aa=>4, :other=>8}
{:white=>68, :latino=>17, :aa=>9, :other=>6}
{:white=>68, :latino=>16, :aa=>12 :other=>4}
{:white=>51, :latino=>27, :aa=>10, :other=>12}
{:white=>69, :latino=>23, :aa=>6, :other=>2}
{:white=>63, :latino=>19, :aa=>14, :other=>4}
(Note the order of the keys above varied; I reordered them to improve readability.)
On could alternatively write
def draw_sample(n)
n.times.map { select_one }.tally
end
though this has the disadvantage that it creates an intermediate array.
See Kernel#rand, the form of Hash::new that takes an argument (the default value, here zero) and Enumerable#tally.
From what I understand, each demographic depends on some external variables. What you could do then is
whites = getWhites(); // could be anything really
aa = getAA();
latinos = getLatinos();
sum = whites + aa + latinos;
whites = whites / sum * 100;
aa = aa / sum * 100;
latinos = latinos / sum * 100;
This guarantees that they always sum up to 100
Edit: The code is pseudocode (not ruby), assuming floating point data type

How to return matches and non matches when comparing two ranges/arrays

I'm attempting to return a count of the total number of matching and non-matching entries in a set of two ranges.
I'm trying to avoid looping over the array twice like this:
#expected output:
#inside: 421 | outside: 55
constant_range = 240..960
sample_range = 540..1015
sample_range_a = sample_range.to_a
def generate_range
inside = sample_range_a.select { |val| constant_range.include?(val) }.count
outside = sample_range_a.select { |val| !constant_range.include?(val) }.count
end
# I was thinking of a counter, but thought that would be even more ineffective
def generate_range
a = 0
b = 0
sample_range_a.select { |val| constant_range.include?(val) ? a++ : b++ }
end
I don't know if this is entirely your case, but if they're always number ranges with an interval of 1 and not any arbitrary array, the solution can be optimized to O(1), unlike the other methods using to_a that are at least O(n). In other words, if you have a BIG range, those array solutions would choke badly.
Assuming that you'll always use an ascending range of numbers with interval of 1, it means you can count them just by using size (count would be our enemy in this situation).
With that said, using simple math you can first check if the ranges may intersect, if not, just return 0. Otherwise, you can finally calculate the new range interval and get its size.
def range_intersection_count(x, y)
return 0 if x.last < y.first || y.last < x.first
([x.begin, y.begin].max..[x.max, y.max].min).size
end
This will count the number of elements that intersect in two ranges in O(1). You can test this code with something like
range_intersection_count(5000000..10000000000, 3000..1000000000000)
and then try the same input with the other methods and watch your program hang.
The final solution would look something like this:
constant_range = (240..960)
sample_range = (540..1015)
inside_count = range_intersection_count(constant_range, sample_range) # = 421
outside_count = sample_range.size - inside_count # = 55
constant_range = (240..960).to_a
sample_range = (540..1015).to_a
inside_count = (sample_range & constant_range).count #inside: 421
outside_count = sample_range.count - inside_count #outside: 55
You can use - (difference) in Ruby:
constant_range = (240..960).to_a
sample_range = (540..1015).to_a
puts (sample_range - constant_range).count # 55

Pick item in array by percentage

I have an array which contains names and percentages.
Example:
[["JAMES", 3.318], ["JOHN", 3.271], ["ROBERT", 3.143]].
Now I have about a thousand of these names, and I'm trying to figure out how to choose a name randomly based on the percentage of the name (like how James as 3.318% and John as 3.271%), so that name will have that percentage of being picked (Robert will have a 3.143% of being picked). Help would be appreciated.
You can use max_by: (the docs contain a similar example)
array.max_by { |_, weight| rand ** 1.fdiv(weight) }
This assumes that your weights are actual percentages, i.e. 3.1% has to be expressed as 0.031. Or, if you don't want to adjust your weights:
array.max_by { |_, weight| rand ** 100.fdiv(weight) }
I'm using fdiv here to account for possible integer values. If your weights are always floats, you can also use /.
Even though I like #Stefan answer more than mine, I will contribute with a possible solution: I would distribute all my percentages along 100.0 so that they start from 0.0 and end to 100.0.
Imagine I have an array with the following percentages:
a = [10.5, 20.5, 17.8, 51.2]
where
a.sum = 100.0
We could write the following to distribute them along 100.0:
sum = 0.0
b = a.map { |el| sum += el }
and the result would be
b = [10.5, 31.0, 48.8, 100.0]
now I can generate a random number between 0.0 and 100.0:
r = rand(0.0..100.0) # or r = rand * 100.0
imagine r is 45.32.
I select the first element of b that is >= r`
idx = b.index { |el| el >= r }
which in our case would return 2.
Now you can select a[idx].
But I would go with #Stefan answer as well :)
I assume you will be drawing multiple random values, in which case efficiency is important. Moreover, I assume that all names are unique and all percentages are positive (i.e., that pairs with percentages of 0.0 have been removed).
You are given what amounts to a (discrete) probability density function (PDF). The first step is to convert that to a cumulative density function (CDF).
Suppose we are given the following array (whose percentages sum to 100).
arr = [["LOIS", 28.16], ["JAMES", 22.11], ["JOHN", 32.71], ["ROBERT", 17.02]]
First, separate the names from the percentages.
names, probs = arr.transpose
#=> [["LOIS", "JAMES", "JOHN", "ROBERT"],
# [28.16, 22.11, 32.71, 17.02]]
Next compute the CDF.
cdf = probs.drop(1).
each_with_object([0.01 * probs.first]) { |pdf, cdf|
cdf << 0.01 * pdf + cdf.last }
#=> [0.2816, 0.5027, 0.8298, 1.0]
The idea is that we will generate a (pseudo) random number between zero and one, r and find the first value c of the CDF for which r <= c.1 To do this in an efficient way we will perform an intelligent search of the CDF. This is possible because the CDF is an increasing function.
I will do a binary search, using Array#bsearch_index. This method is essentially the same as Array#bseach (whose doc is the relevant one), except the index of cdf is returned rather than the element of cdf is randomly selected. It will shortly be evident why we want the index.
r = rand
#=> 0.6257547400776025
idx = cdf.bsearch_index { |c| r <= c }
#=> 2
Note that we cannot write cdf.bsearch_index { |c| rand <= c } as rand would be executed each time the block is evaluated.
The randomly-selected name is therefore2
names[idx]
#=> "JOHN"
Now let's put all this together.
def setup(arr)
#names, probs = arr.transpose
#cdf = probs.drop(1).
each_with_object([0.01*probs.first]) { |pdf, cdf| cdf << 0.01 * pdf + cdf.last }
end
def random_name
r = rand
#names[#cdf.bsearch_index { |c| r <= c }]
end
Let's try it. Execute setup to compute the instance variables #names and #cdf.
setup(arr)
#names
#=> ["LOIS", "JAMES", "JOHN", "ROBERT"]
#cdf
#=> [0.2816, 0.5027, 0.8298, 1.0]
and then call random_name each time a random name is wanted.
5.times.map { random_name }
#=> ["JOHN", "LOIS", "JAMES", "LOIS", "JAMES"]
1. This is how most discrete random variates are generated in simulation models.
2. Had I used bsearch rather than bsearch_index I would have had to earlier create a hash with cdf=>name key-value pairs in order to retrieve a name for a given randomly-selected CDF value.
This is my solution to the problem:
array = [["name1", 33],["name2", 20],["name3",10],["name4",7],["name5", 30]]
def random_name(array)
random_number = rand(0.000..100.000)
sum = 0
array.each do |x|
if random_number.between?(sum, sum + x[1])
return x[0]
else
sum += x[1]
end
end
end
puts random_name(array)

How to get a random number with a given discrete distribution in Ruby

I'm coding my university assignment that is somewhat connected with distributions and random roll stuff. So the question is: how to get a random number with a given discrete distribution in Ruby.
To be more specific: in trivial example with normal discrete distribution like (0 with P=1/2; 1000 with P=1/2) I could write such a function:
def chooseNumber(a,b)
rval = Random.rand(0..1)
return a if rval == 0
return b
end
First question: is there any way to write it using native Random class?
Second question: what is the best way to deal with distributions like (0 with P=1/5; 2 with P=2/5; 1000 with P=2/5) or even worse (0 with P=0,33; 2 with P=0,49; 1000 with P=0,18)?
I would go with something like this
def pick_with_distribution(distributions)
r = rand
distributions.detect{ |k, d| r -= d; r < 0 }.first
end
distributions = { 0 => 0.33, 2 => 0.49, 1000 => 0.18 }
pick_with_distribution(distributions)
#=> 0
To check if distribution is correct, I run it 10000 times, here is the result:
10000.times.inject({}) do |h, _|
r = pick_with_distribution(distributions)
h[r] ||= 0
h[r] += 1
h
end
#=> {0=>3231, 1000=>1860, 2=>4909}

Is there a way to find out where a number lies in a range in ruby?

Let's say I have a min and a max number. max can be anything, but min will always be greater than zero.
I can get the range min..max and let's say I have a third number, count -- I want to divide the range by 10 (or some other number) to get a new scale. So, if the range is 1000, it would increment in values of 100, 200, 300, and find out where the count lies within the range, based on my new scale. So, if count is 235, it would return 2 because that's where it lies on the range scale.
Am I making any sense? I'm trying to create a heat map based on a range of values, basically ... so I need to create the scale based on the range and find out where the value I'm testing lies on that new scale.
I was working with something like this, but it didn't do it:
def heat_map(project, word_count, division)
unless word_count == 0
max = project.words.maximum('quantity')
min = project.words.minimum('quantity')
range = min..max
total = range.count
break_point = total / division
heat_index = total.to_f / word_count.to_f
heat_index.round
else
"freezing"
end
end
I figured there's probably an easier ruby way I'm missing.
Why not just use arithmetic and rounding? Assuming that number is between min and max and you want the range split into n_div divisions and x is the number you want to find the index of (according to above it looks like min = 0, max = 1000, n_div = 10, and x = 235):
def heat_index(x, min, max, n_div)
break_point = (max - min).to_f/n_div.to_f
heat_index = (((x - min).to_f)/break_point).to_i
end
Then heat_index(235, 0, 1000, 10) gives 2.
I'm just quickly brainstorming an idea, but would something like this help?
>> (1..100).each_slice(10).to_a.index { |subrange| subrange.include? 34 }
=> 3
>> (1..100).each_slice(5).to_a.index { |subrange| subrange.include? 34 }
=> 6
This tells you in which subrange (the subrange size is determined by the argument to each_slice) the value (the argument to subrange.include?) lies.
>> (1..1000).each_slice(100).to_a.index { |subrange| subrange.include? 235 }
=> 2
Note that the indices for the subranges start from 0, so you may want to add 1 to them depending on what you need. Also this isn't ready as is, but should be easy to wrap up in a method.
How's this? It makes an array of range boundaries and then checks if the number lies between them.
def find_range(min, max, query, increment)
values = []
(min..max).step(increment) { |value| values << value }
values.each_with_index do |value, index|
break if values[index + 1].nil?
if query > value && query < values[index + 1]
return index
end
end
end
EDIT: removed redundant variable

Resources