What is the probability of the survival of a tribble? - algorithm

You have a population of k Tribbles. This particular species of Tribbles live for exactly one day and then die. Just before death, a single Tribble has the probability P_i of giving birth to i more Tribbles. What is the probability that after m generations, every Tribble will be dead?
Is my analysis right? If it is right, why it not matching the output?
Case 1:
Number of tribbles: k = 1
Number of generations: m = 1
Probabilities: P_0 = 0.33 P_1 = 0.34 P_2 = 0.33
The probability that after 1 generation every Tribble would be dead = P_0 = 0.33
Case 2:
Number of tribbles: k = 1
Number of generations: m = 2
Probabilities: P_0 = 0.33 P_1 = 0.34 P_2 = 0.33
Each tribble can have either 0 or 1 or 2 children.
At the end of the first year there has to be at least one tribble to ensure that there are tribbles in the second generation also.
The tribble of the first generation should have 1 or 2 children. So, the number of tribbles at the end of the first year would be either 1 or 2 with probabilities P_1=0.34 P_1=0.34 and P_2=0.33 P_2=0.33 respectively.
If there is to be no children after the second generation, none of these children should have children of their own.
If there is 1 child in the second generation, the probability it would have no children is P_0=0.33
If there are 2 children in the second generation, the probability that none of them would have children is (P_0)^2=(0.33)^2=0.1089
The probability that after 2 generations every tribble would be dead is the probability of there being 1 child times the probability of it not having children plus the probability of there being 2 children times the probability of none of them having children =0.34×0.33+0.33×0.0.1089=0.148137

You miss 1st generation 0 child case
The correct equation is
P0 x 1 + P1 x P0 + P2 x P0^2
= 0.33 + 0.34 x 0.33 + 0.33 x (0.33)^2
= 0.478137

Related

How to sort different related units?

I've given a task in which the user enters some unit relations and we have to sort them from high to low.
What is the best algorithm to do that?
I put some input/output pairs to clarify the problem:
Input:
km = 1000 m
m = 100 cm
cm = 10 mm
Output:
1km = 1000m = 100000cm = 1000000mm
Input:
km = 100000 cm
km = 1000000 mm
m = 1000 mm
Output:
1km = 1000m = 100000cm = 1000000mm
Input:
B = 8 b
MiB = 1024 KiB
KiB = 1024 B
Mib = 1048576 b
Mib = 1024 Kib
Output:
1MiB = 8Mib = 1024KiB = 8192Kib = 1048576B = 8388608b
Input:
B = 8 b
MiB = 1048576 B
MiB = 1024 KiB
MiB = 8192 Kib
MiB = 8 Mib
Output:
1MiB = 8Mib = 1024KiB = 8192Kib = 1048576B = 8388608b
How to generate output based on given output?
My attempt at a graph-based solution. Example 3 is the most interesting, so I'll take that one, (multiple steps and multiple sinks.)
Transform B = n A to edge A -> B and label it n, n > 1. If it's not a connected DAG, it's inconsistent.
Reduce to a bipartite graph by making multiple connections I -> J -> K skip to I -> K by multiplying the n of I -> J by J -> K. Any inconsistencies are a sign that the problem is inconsistent.
The idea of this step is to produce only one single greatest value. A vertex on the left with a degree of greater than 1, P, and { Q, R } are in the right set, where, P -> Q labelled n1 and P -> R labelled n2, 1 < n1 < n2, (WLOG,) can be transformed into P -> R (unchanged) and Q -> R with label n2 / n1 (bringing Q, in this case Mib, from right to left.)
Is the graph bipartite with a single right node? No, goto 2.
Sort the edges.
X -> Z with n1 ... Y -> Z with n2 becomes 1 Z = n1 X = ... = n2 Y.
You can find the following algorithm:
1. detect all existing units: `n` units
2. create a `n x n` matrix `M` such that the same rows and columns show
the corresponding unit. put all elements of the main diagonal of the
matrix to `1`.
3. put the specified value in the input into the corresponding row and column.
4. put zero for the transpose of the row and the column in step 3.
5. put `-1` for all other elements
Now, based on `M` you can easily find the biggest unit:
5.1 candidate_maxs <-- Find columns with only one non-zero positive element
not_max <-- []
6. while len(candidate_max)> 1:
a. take a pair <i, l> and find a column h such that both (i, h)
and (l, h) are known, i.e., they are positive.
If M[i, h] > M[l, h]:
remove_item <-- l
Else:
remove_item <-- i
candidate_max.remove(remove_item)
not_max.append(remove_item)
b. if cannot find such a pair, find a pair <i, l>: i from
candidate_max and h from not_max with the same property.
If M[i, h] < M[l, h]:
candidate_max.remove(i)
not_max.append(i)
biggest_unit <-- The only element of candidate_max
By finding the biggest unit, you can order others based on their value in the corresponding row of the biggest_unit.
7. while there is `-1` value in the row `biggest_unit` on column `j`:
`(biggest_unit, j)`
a. find a non-identity and non-zero positive element in (column `j`
and row `k`) or (row `j` and column `k`), i.e., `(k,j)` or `(j, k)`, such that `(biggest_unit, k)` is strictly
positive and non-identity. Then, calculate the missing value
based on the found equivalences.
b. if there is not such a row, continue the loop with another `-1`
unit element.
8. sort units based on their column value in `biggest_unit` row in
ascending order.
However, the time complexity of the algorithm is Theta(n^2) that n is the number of units (if you implement the loop on step 6 wisely!).
Example
Input 1
km = 1000 m
m = 100 cm
cm = 10 mm
Solution:
km m cm mm
km 1 1000 -1 -1
m 0 1 100 -1
cm -1 0 1 10
mm -1 -1 0 1
M = [1 1000 -1 -1
0 1 100 -1
-1 0 1 10
-1 -1 0 1]
===> 6. `biggest_unit` <--- km (column 1)
7.1 Find first `-1` in the first row and column 3: (1,3)
Find strictly positive value in row 2 such that (1,2) is strictly
positive and non-identity. So, the missing value of `(1,3)` must be
`1000 * 100 = 100000`.
7.2 Find the second `-1` in the first row and column 4: (1,4)
Find strictly positive value in row 3 such that (1,3) is strictly
positive and non-identity. So, the missing value of `(1,4)` must be
`100000 * 10 = 1000000`.
The loop is finished here and we have:
M = [1 1000 100000 1000000
0 1 100 -1
-1 0 1 10
-1 -1 0 1]
Now you can sort the elements of the first row in ascending order.
Input 2
km = 100000 cm
km = 1000000 mm
m = 1000 mm
Solution:
km m cm mm
km 1 -1 100000 1000000
m -1 1 -1 1000
cm 0 -1 1 -1
mm 0 0 -1 1
M = [1 -1 100000 1000000
-1 1 -1 1000
0 -1 1 -1
0 0 -1 1]
===>
6.1 candidate_max = [1, 2]
6.2 Compare them on column 4 and remove 2
biggest_unit <-- column 1
And by going forward on step 7,
Find first `-1` in the first row and column 2: (1,2)
Find a strictly positive and non-identity value in row 2:(1,4)
So, the missing value of `(1,2)` must be `1000000 / 1000 = 1000`.
In sum, we have:
M = [1 1000 100000 1000000
-1 1 -1 1000
0 -1 1 -1
0 0 -1 1]
Now you can sort the elements of the first row in ascending order (step 8).

A variant of the Knapsack algorithm

I have a list of items, a, b, c,..., each of which has a weight and a value.
The 'ordinary' Knapsack algorithm will find the selection of items that maximises the value of the selected items, whilst ensuring that the weight is below a given constraint.
The problem I have is slightly different. I wish to minimise the value (easy enough by using the reciprocal of the value), whilst ensuring that the weight is at least the value of the given constraint, not less than or equal to the constraint.
I have tried re-routing the idea through the ordinary Knapsack algorithm, but this can't be done. I was hoping there is another combinatorial algorithm that I am not aware of that does this.
In the german wiki it's formalized as:
finite set of objects U
w: weight-function
v: value-function
w: U -> R
v: U -> R
B in R # constraint rhs
Find subset K in U subject to:
sum( w(u) <= B ) | all w in K
such that:
max sum( v(u) ) | all u in K
So there is no restriction like nonnegativity.
Just use negative weights, negative values and a negative B.
The basic concept is:
sum( w(u) ) <= B | all w in K
<->
-sum( w(u) ) >= -B | all w in K
So in your case:
classic constraint: x0 + x1 <= B | 3 + 7 <= 12 Y | 3 + 10 <= 12 N
becomes: -x0 - x1 <= -B |-3 - 7 <=-12 N |-3 - 10 <=-12 Y
So for a given implementation it depends on the software if this is allowed. In terms of the optimization-problem, there is no problem. The integer-programming formulation for your case is as natural as the classic one (and bounded).
Python Demo based on Integer-Programming
Code
import numpy as np
import scipy.sparse as sp
from cylp.cy import CyClpSimplex
np.random.seed(1)
""" INSTANCE """
weight = np.random.randint(50, size = 5)
value = np.random.randint(50, size = 5)
capacity = 50
""" SOLVE """
n = weight.shape[0]
model = CyClpSimplex()
x = model.addVariable('x', n, isInt=True)
model.objective = value # MODIFICATION: default = minimize!
model += sp.eye(n) * x >= np.zeros(n) # could be improved
model += sp.eye(n) * x <= np.ones(n) # """
model += np.matrix(-weight) * x <= -capacity # MODIFICATION
cbcModel = model.getCbcModel()
cbcModel.logLevel = True
status = cbcModel.solve()
x_sol = np.array(cbcModel.primalVariableSolution['x'].round()).astype(int) # assumes existence
print("INSTANCE")
print(" weights: ", weight)
print(" values: ", value)
print(" capacity: ", capacity)
print("Solution")
print(x_sol)
print("sum weight: ", x_sol.dot(weight))
print("value: ", x_sol.dot(value))
Small remarks
This code is just a demo using a somewhat low-level like library and there are other tools available which might be better suited (e.g. windows: pulp)
it's the classic integer-programming formulation from wiki modifies as mentioned above
it will scale very well as the underlying solver is pretty good
as written, it's solving the 0-1 knapsack (only variable bounds would need to be changed)
Small look at the core-code:
# create model
model = CyClpSimplex()
# create one variable for each how-often-do-i-pick-this-item decision
# variable needs to be integer (or binary for 0-1 knapsack)
x = model.addVariable('x', n, isInt=True)
# the objective value of our IP: a linear-function
# cylp only needs the coefficients of this function: c0*x0 + c1*x1 + c2*x2...
# we only need our value vector
model.objective = value # MODIFICATION: default = minimize!
# WARNING: typically one should always use variable-bounds
# (cylp problems...)
# workaround: express bounds lower_bound <= var <= upper_bound as two constraints
# a constraint is an affine-expression
# sp.eye creates a sparse-diagonal with 1's
# example: sp.eye(3) * x >= 5
# 1 0 0 -> 1 * x0 + 0 * x1 + 0 * x2 >= 5
# 0 1 0 -> 0 * x0 + 1 * x1 + 0 * x2 >= 5
# 0 0 1 -> 0 * x0 + 0 * x1 + 1 * x2 >= 5
model += sp.eye(n) * x >= np.zeros(n) # could be improved
model += sp.eye(n) * x <= np.ones(n) # """
# cylp somewhat outdated: need numpy's matrix class
# apart from that it's just the weight-constraint as defined at wiki
# same affine-expression as above (but only a row-vector-like matrix)
model += np.matrix(-weight) * x <= -capacity # MODIFICATION
# internal conversion of type neeeded to treat it as IP (or else it would be
LP)
cbcModel = model.getCbcModel()
cbcModel.logLevel = True
status = cbcModel.solve()
# type-casting
x_sol = np.array(cbcModel.primalVariableSolution['x'].round()).astype(int)
Output
Welcome to the CBC MILP Solver
Version: 2.9.9
Build Date: Jan 15 2018
command line - ICbcModel -solve -quit (default strategy 1)
Continuous objective value is 4.88372 - 0.00 seconds
Cgl0004I processed model has 1 rows, 4 columns (4 integer (4 of which binary)) and 4 elements
Cutoff increment increased from 1e-05 to 0.9999
Cbc0038I Initial state - 0 integers unsatisfied sum - 0
Cbc0038I Solution found of 5
Cbc0038I Before mini branch and bound, 4 integers at bound fixed and 0 continuous
Cbc0038I Mini branch and bound did not improve solution (0.00 seconds)
Cbc0038I After 0.00 seconds - Feasibility pump exiting with objective of 5 - took 0.00 seconds
Cbc0012I Integer solution of 5 found by feasibility pump after 0 iterations and 0 nodes (0.00 seconds)
Cbc0001I Search completed - best objective 5, took 0 iterations and 0 nodes (0.00 seconds)
Cbc0035I Maximum depth 0, 0 variables fixed on reduced cost
Cuts at root node changed objective from 5 to 5
Probing was tried 0 times and created 0 cuts of which 0 were active after adding rounds of cuts (0.000 seconds)
Gomory was tried 0 times and created 0 cuts of which 0 were active after adding rounds of cuts (0.000 seconds)
Knapsack was tried 0 times and created 0 cuts of which 0 were active after adding rounds of cuts (0.000 seconds)
Clique was tried 0 times and created 0 cuts of which 0 were active after adding rounds of cuts (0.000 seconds)
MixedIntegerRounding2 was tried 0 times and created 0 cuts of which 0 were active after adding rounds of cuts (0.000 seconds)
FlowCover was tried 0 times and created 0 cuts of which 0 were active after adding rounds of cuts (0.000 seconds)
TwoMirCuts was tried 0 times and created 0 cuts of which 0 were active after adding rounds of cuts (0.000 seconds)
Result - Optimal solution found
Objective value: 5.00000000
Enumerated nodes: 0
Total iterations: 0
Time (CPU seconds): 0.00
Time (Wallclock seconds): 0.00
Total time (CPU seconds): 0.00 (Wallclock seconds): 0.00
INSTANCE
weights: [37 43 12 8 9]
values: [11 5 15 0 16]
capacity: 50
Solution
[0 1 0 1 0]
sum weight: 51
value: 5

Fitness Proportionate Selection when some fitnesses are 0

I have a question about what to do with the fitnesses (fitness'?) that are 0 when getting the fitness proportionate probabilities. Should the container for the members be sorted by highest fitness first, then do code similar to this:
for all members of population
sum += fitness of this individual
end for
for all members of population
probability = sum of probabilities + (fitness / sum)
sum of probabilities += probability
end for
loop until new population is full
do this twice
number = Random between 0 and 1
for all members of population
if number > probability but less than next probability then you have been selected
end for
end
create offspring
end loop
My problem that I am seeing as I go through one iteration by hand with randomly generated members is that I have some member's fitness as 0, but when getting the probability of those members, it keeps the same probability as the last non zero member. Is there a way I can separate the non zero probabilities from the zero probabilities? I was thinking that even if I sort based on highest fitness, the last non zero member would have the same probability as the zero probabilities.
Consider this example:
individual fitness(i) probability(i) partial_sum(i)
1 10 10/20 = 0.50 0.50
2 3 3/20 = 0.15 0.5+0.15 = 0.65
3 2 2/20 = 0.10 0.5+0.15+0.1 = 0.75
4 0 0/20 = 0.00 0.5+0.15+0.1+0.0 = 0.75
5 5 5/20 = 0.25 0.5+0.15+0.1+0.0+0.25 = 1.00
------
Sum 20
Now if number = Random between [0;1[ we are going to pick individual i if:
individual condition
1 0.00 <= number < partial_sum(1) = 0.50
2 0.50 = partial_sum(1) <= number < partial_sum(2) = 0.65
3 0.65 = partial_sum(2) <= number < partial_sum(3) = 0.75
4 0.75 = partial_sum(3) <= number < partial_sum(4) = 0.75
5 0.75 = partial_sum(4) <= number < partial_sum(5) = 1.00
If an individual has fitness 0 (e.g. I4) it cannot be selected because of its selection condition (e.g. I4 has the associated condition 0.75 <= number < 0.75).

How to implement a cumulative product table?

Given the following problem:
There is a sequence of k integers, named s for which there can be 2 operations,
1) Sum[i,j] -
What is the value of s[i]+s[i+1]+...+s[j]?
2) Update[i,val] -
Change the value of s[i] to val.
I am sure most people here have heard of using a cumulative frequency table/fenwick tree to optimize the complexity.
Now, if I don't want to query the sum but instead I want to perform the following:
Product[i,j] -
What is the value of s[i] * s[i+1] * ... * s[j]?
The new problem seems trivial at first, at least for the first operation Product[i,j].
Assuming I am using a cummulative product table named f:
At first thought, when we call Update[i,val], we should divide the cummulative products at f[z] for z from i -> j by the old value of s[i] then multiply by the new value.
But we will face 2 issues if the old value of s[i] is 0:
Division by 0. But this is easily tackled by checking if the old value of s[i] is 0.
The product of any real number with 0 is 0. This result will cause all other values from f[i] to f[j] to be 0. So we are unable to successfully perform Update[i,val]. This problem is not so trivial as it affects other values besides f[i].
Does anyone have any ideas how I could implement a cummulative product table that supports the 2 operations mentioned above?
Maintain 2 tables:
A cumulative product table, in which all zero entries have been stored as ones instead (to avoid affecting other entries).
A cumulative sum storing the number of zero entries. Each entry s[i] is 1 if f[i] is 0 and 0 if non-zero.
To compute the cumulative product, first calculate the cumulative sum of zero entries in the given range. If non-zero (i.e. there is 1 or more zero in the range) then the cumulative product is zero. If zero then calculate the cumulative product as you describe.
It might be more accurate to store your factors as logarithms in some base and compute the cumulative product as a sum of log values. You'd just be computing 2 cumulative sums. In that case you would need to store zero entries in the product table as log values of 0 (i.e. values of 1).
Here's an example, using a simple cumulative sum (not Fenwick trees, but you could easily use them instead):
row f cum_f isZero cum_isZero log(f) cum_log(f)
-1 1 1 0 0 0 0
0 3 3 0 0 0.477 0.477
1 0 3 1 1 -inf 0.477
2 4 12 0 1 0.602 1.079
3 2 24 0 1 0.301 1.38
4 3 72 0 1 0.477 1.857
row is the index, f is the factor, cum_f is the cumulative product of f treating zeros as if they were ones, isZero is a flag to indicate if f is zero, cum_isZero is the cumulative sum of the isZero flags, log(f) is the log of f in base 10, cum_log(f) is the cumulative sum of log_f, treating -inf as zero.
To calculate the sum or product of a range from row i to row j (inclusive), subtract row[i-1] from row[j], using row -1 as a "virtual" row.
To calculate the cumulative product of f in rows 0-2, first find the cumulative sum of isZero: cum_isZero[2] - cum_isZero[-1] = 1 - 0 = 1. That's non-zero, so the cumulative product is 0
To calculate the cumulative product of f in rows 2-4, do as above: cum_isZero[4] - cum_isZero[1] = 0 - 0 = 0. That's zero, so we can calculate the product.
Using cum_f: cum_f[4] / cum_f[1] = 72 / 3 = 24 = 4 x 2 x 3
Using cum_log_f: cum_log(f)[4] - cum_log(f)[1] = 1.857 - 0.477 = 1.38
101.38 = approx 24

What are some good ways to calculate a score for how difference or close 2 users choices are?

For example, if it is the choice of chocolate, ice cream, donut, ..., for the order of their preference.
If user 1 choose
A B C D E F G H I J
and user 2 chooses
J A B C I G F E D H
what are some good ways to calculate a score from 0 to 100 to tell how close their choices are? It has to make sense, such as if most answers are the same but just 1 or 2 answers different, the score cannot be made to extremely low. Or, if most answers are just "shifted by 1 position", then we cannot count them as "all different" and give 0 score for those differences of only 1 position.
Assign each letter item an integer value starting at 1
A=1, B=2, C=3, D=4, E=5, F=6 (stopping at F for simplicity)
Then consider the order the items are placed, use this as a multiple
So if a number is the first item, its multiplier is 1, if its the 6th item the multipler is 6
Figure out the maximum score you could have (basically when everything is in consecutive order)
item a b c d e f
order 1 2 3 4 5 6
value 1 2 3 4 5 6
score 1 4 9 16 25 36 Sum = 91, Score = 100% (MAX)
item a b d c e f
order 1 2 3 4 5 6
value 1 2 4 3 5 6
score 1 4 12 12 25 36 Sum = 90 Score = 99%
=======================
order 1 2 3 4 5 6
item f d b c e a
value 6 4 2 3 5 1
score 6 8 6 12 25 6 Sum = 63 Score = 69%
order 1 2 3 4 5 6
item d f b c e a
value 4 6 2 3 5 1
score 4 12 6 12 25 6 Sum = 65 Score = 71%
obviously this is a very crude implementation that I just came up with. It may not work for everything. Examples 3 and 4 are swapped by one position yet the score is off by 2% (versus ex 1 and 2 which are off by 1%). It's just a thought. I'm no algorithm expert. You could probably use the final number and do something else to it for a better numerical comparison.
You could
Calculate the edit distance between the sequences;
Subtract the edit distance from the sequence length;
Divide that by the length of the sequence
Multiply it by hundred
Score = 100 * (SequenceLength - Levenshtein( Sequence1, Sequence2 ) ) / SequenceLength
Edit distance is basically the number of operations required to transform sequence one in sequence two. An algorithm therefore is the Levenshtein distance algorithm.
Examples:
Weights
insert: 1
delete: 1
substitute: 1
Seq 1: ABCDEFGHIJ
Seq 2: JABCIGFEDH
Score = 100 * (10-7) / 10 = 30
Seq 1: ABCDEFGHIJ
Seq 2: ABDCFGHIEJ
Score = 100 * (10-3) / 10 = 70
The most straightforward way to calculate it is the Levenshtein distance, which is the number of changes that must be done to transform one string to another.
Disadvantage of Levenshtein distance for your task is that it doesn't measure closeness between products themselves. I.e. you will not know how A and J are close to each other. For example, user 1 may like donuts, and user 2 may like buns, and you know that most people who like first also like the second. From this information you can infer that user 1 makes choices that are close to choices of user 2, through they don't have same elements.
If this is your case, you will have to use one of two: statistical methods to infer correlation between choices or recommendation engines.

Resources