How to generate multiple unrepeated random variables with a fixed non-uniform distribution? - probability

So for a given discrete distribution, say (0.2, 0.4, 0.4), it would be easy to generate one random number that follow this distribution.
However, what about to generate multiple unrepeated random numbers from it?
e.g. from distribution (p1 = 0.2, p2 = 0.4, p3 = 0.4), if I generate
(1,2) with p12 = 0.2,
(2,3) with p23 = 0.6,
(1,3) with p13 = 0.2.
I'm able to have the marginal distribution of
p1 = (p12 + p13)/2 = 0.2,
p2 = (p23 + p12)/2 = 0.4,
p3 = (p13 + p23)/2 = 0.4.
which is same as the given distribution.
Any idea to build a generator to accomplish this in terms of a general distribution? Thanks:)

If you look at the problem as a set of linear equations, you will be able to express it using a matrix equation. For instance:
|1/2 1/2 0 | |p12| |p1|
|1/2 0 1/2| * |p13| = |p2|
| 0 1/2 1/2| |p23| |p3|
Now you can invert the matrix to get:
|p12| | 1 1 -1| |p1|
|p13| = | 1 -1 1| * |p2|
|p23| |-1 1 1| |p3|
In your example this will produce:
|p12| | 1 1 -1| |0.2| |0.2|
|p13| = | 1 -1 1| * |0.4| = |0.2|
|p23| |-1 1 1| |0.4| |0.6|
So, p12 = p13 = 0.2 and p23 = 0.6.

Related

Calculating final market distribution - competitive programming

I came across following question while practicing competitive programming. I solved it manually, kinda designing an approach, but my answer is wrong and I cannot imagine how to scale my approach.
Question:
N coffee chains are competing for market share by a fierce advertising battle. each day a percentage of customers will be convinced to switch from one chain to another.
Current market share and daily probability of customer switching is given. If the advertising runs forever, what will be the final distribution of market share?
Assumptions: Total market share is 1.0, probability that a customer switches is independent of other customers and days.
Example: 2 coffee chains: A and B market share of A: 0.4 market share of B: 0.6.
Each day, there is a 0.2 probability that a customer switches from A to B Each day, there is a 0.1 probability that a customer switches from B to A
input: market_share=[0.4,0.6],
switch_prob = [[.8,.2][.1,.9]]
output: [0.3333 0.6667]
Everything till here is part of a question, I did not form the example or assumptions, they were given with the question.
My_attempt: In my understanding, switch probabilities indicate the probability of switching the from A to B.
Hence,
market_share_of_A = current_market_share - lost_customers + gained_customers and
marker_share_of_B = (1 - marker_share_of_A)
iter_1:
lost_customers = 0.4 * 0.8 * 0.2 = 0.064
gained_customers = 0.6 * 0.2 * 0.1 = 0.012
market_share_of_A = 0.4 - 0.064 + 0.012 = 0.348
marker_share_of_B = 1 - 0.348 = 0.652
iter_2:
lost_customers = 0.348 * 0.1 * 0.2 = 0.00696
gained_customers = 0.652 * 0.9 * 0.1 = 0.05868
market_share_of_A = 0.348 - 0.00696 + 0.05868 = 0.39972
marker_share_of_B = 1 - 0.32928 = 0.60028
my answer: [0.39972, 0.60028]
As stated earlier, expected answers are [0.3333 0.6667].
I do not understand where am I wrong? If something is wrong, it has to be my understanding of the question. Please provide your thoughts.
In the example, they demonstrated an easy case that there were only two competitors. What if there are more? Let us say three - A, B, C. I think input has to provide switch probabilities in the form [[0.1, 0.3, 0.6]..] because A can lose its customers to B as well as C and there would be many instances of that. Now, I will have to compute at least two companies market share, third one will be (1-sum_of_all). And while computing B's market share, I will have to compute it's lost customers as well as gained and formula would be (current - lost + gained). Gained will be sum of gain_from_A and gain_from_C. Is this correct?
Following on from my comment, this problem can be expressed as a matrix equation.
The elements of the "transition" matrix, T(i, j) (dimensions N x N) are defined as follows:
i = j (diagonal): the probability of a customer staying with chain i
i != j (off-diagonal): the probability of a customer of chain j transferring to chain i
What is the physical meaning of this matrix? Let the market share state be represented by a vector P(i) of size N, whose i-th value is the market share of chain i. The vector P' = T * P is the next share state after each day.
With that in mind, the equilibrium equation is given by T * P = P, i.e. the final state is invariant under transition T:
| T(1, 1) T(1, 2) T(1, 3) ... T(1, N) | | P(1) | | P(1) |
| T(2, 1) T(2, 2) ... | | P(2) | | P(2) |
| T(3, 1) ... | | P(3) | | P(3) |
| . . | * | . | = | . |
| . . | | . | | . |
| . . | | . | | . |
| T(N, 1) T(N, N) | | P(N) | | P(N) |
However, this is unsolvable by itself - P can only be determined up to a number of ratios between its elements (the technical name for this situation escapes me - as MBo suggests it is due to degeneracy). There is an additional constraint that the shares add up to 1:
P(1) + P(2) + ... P(N) = 1
We can choose an arbitrary share value (say, the Nth one) and replace it with this expression. Multiplying out, the first row of the equation is:
T(1, 1) P(1) + T(1, 2) P(2) + ... T(1, N) (1 - [P(1) + P(2) + ... P(N - 1)]) = P(1)
--> [T(1, 1) - T(1, N) - 1] P(1) + [T(1, 2) - T(1, N)] P(2) + ... "P(N - 1)" = -T(1, N)
The equivalent equation for the second row is:
[T(2, 1) - T(2, N)] P(1) + [T(2, 2) - T(2, N) - 1] P(2) + ... = -T(2, N)
To summarize the general pattern, we define:
A matrix S(i, j) (dimensions [N - 1] x [N - 1]):
- S(i, i) = T(i, i) - T(i, N) - 1
- S(i, j) = T(i, j) - T(i, N) (i != j)
A vector Q(i) of size N - 1 containing the first N - 1 elements of P(i)
A vector R(i) of size N - 1, such that R(i) = -T(i, N)
The equation then becomes S * Q = R:
| S(1, 1) S(1, 2) S(1, 3) ... S(1, N-1) | | Q(1) | | R(1) |
| S(2, 1) S(2, 2) ... | | Q(2) | | R(2) |
| S(3, 1) ... | | Q(3) | | R(3) |
| . . | * | . | = | . |
| . . | | . | | . |
| . . | | . | | . |
| S(N-1, 1) S(N-1, N-1) | | Q(N-1) | | R(N-1) |
Solving the above equation gives Q, which gives the first N - 1 share values (and of course the last one too from the constraint). Methods for doing so include Gaussian elimination and LU decomposition, both of which are more efficient than the naive route of directly computing Q = inv(S) * R.
Note that you can flip the signs in S and R for slightly more convenient evaluation.
The toy example given above turns out to be quite trivial:
| 0.8 0.1 | | P1 | | P1 |
| | * | | = | |
| 0.2 0.9 | | P2 | | P2 |
--> S = | -0.3 |, R = | -0.1 |
--> Q1 = P1 = -1.0 / -0.3 = 0.3333
P2 = 1 - P1 = 0.6667
An example for N = 3:
| 0.1 0.2 0.3 | | -1.2 -0.1 | | -0.3 |
T = | 0.4 0.7 0.3 | --> S = | | , R = | |
| 0.5 0.1 0.4 | | 0.1 -0.6 | | -0.3 |
| 0.205479 |
--> Q = | | , P3 = 0.260274
| 0.534247 |
Please forgive the Robinson Crusoe style formatting - I'll try to write these in LaTeX later for readability.

Why don't we include 0 matches while calculating jaccard distance between binary numbers?

I am working on a program based on Jaccard Distance, and I need to calculate the Jaccard Distance between two binary bit vectors. I came across the following on the net:
If p1 = 10111 and p2 = 10011,
The total number of each combination attributes for p1 and p2:
M11 = total number of attributes where p1 & p2 have a value 1,
M01 = total number of attributes where p1 has a value 0 & p2 has a value 1,
M10 = total number of attributes where p1 has a value 1 & p2 has a value 0,
M00 = total number of attributes where p1 & p2 have a value 0.
Jaccard similarity coefficient = J =
intersection/union = M11/(M01 + M10 + M11)
= 3 / (0 + 1 + 3) = 3/4,
Jaccard distance = J' = 1 - J = 1 - 3/4 = 1/4,
Or J' = 1 - (M11/(M01 + M10 + M11)) = (M01 + M10)/(M01 + M10 + M11)
= (0 + 1)/(0 + 1 + 3) = 1/4
Now, while calculating the coefficient, why was "M00" not included in the denominator? Can anyone please explain?
Jaccard coefficient is a measure of asymmetric binary attributes,f.e., a scenario where the presence of an item is more important than its absence.
Since M00 deals only with absence, we do not consider it while calculating Jaccard coeffecient.
For example, while checking for the presence/absence of a disease, the presence of the disease is the more significant outcome.
Hope it helps!
The Jacquard index of A and B is |A∩B|/|A∪B| = |A∩B|/(|A| + |B| - |A∩B|).
We have: |A∩B| = M11, |A| = M11 + M10, |B| = M11 + M01.
So |A∩B|/(|A| + |B| - |A∩B|) = M11 / (M11 + M10 + M11 + M01 - M11) = M11 / (M10 + M01 + M11).
This Venn diagram may help:

Kernel density estimation julia

I am trying to implement a kernel density estimation. However my code does not provide the answer it should. It is also written in julia but the code should be self explanatory.
Here is the algorithm:
where
So the algorithm tests whether the distance between x and an observation X_i weighted by some constant factor (the binwidth) is less then one. If so, it assigns 0.5 / (n * h) to that value, where n = #of observations.
Here is my implementation:
#Kernel density function.
#Purpose: estimate the probability density function (pdf)
#of given observations
##param data: observations for which the pdf should be estimated
##return: returns an array with the estimated densities
function kernelDensity(data)
|
| #Uniform kernel function.
| ##param x: Current x value
| ##param X_i: x value of observation i
| ##param width: binwidth
| ##return: Returns 1 if the absolute distance from
| #x(current) to x(observation) weighted by the binwidth
| #is less then 1. Else it returns 0.
|
| function uniformKernel(x, observation, width)
| | u = ( x - observation ) / width
| | abs ( u ) <= 1 ? 1 : 0
| end
|
| #number of observations in the data set
| n = length(data)
|
| #binwidth (set arbitraily to 0.1
| h = 0.1
|
| #vector that stored the pdf
| res = zeros( Real, n )
|
| #counter variable for the loop
| counter = 0
|
| #lower and upper limit of the x axis
| start = floor(minimum(data))
| stop = ceil (maximum(data))
|
| #main loop
| ##linspace: divides the space from start to stop in n
| #equally spaced intervalls
| for x in linspace(start, stop, n)
| | counter += 1
| | for observation in data
| | |
| | | #count all observations for which the kernel
| | | #returns 1 and mult by 0.5 because the
| | | #kernel computed the absolute difference which can be
| | | #either positive or negative
| | | res[counter] += 0.5 * uniformKernel(x, observation, h)
| | end
| | #devide by n times h
| | res[counter] /= n * h
| end
| #return results
| res
end
#run function
##rand: generates 10 uniform random numbers between 0 and 1
kernelDensity(rand(10))
and this is being returned:
> 0.0
> 1.5
> 2.5
> 1.0
> 1.5
> 1.0
> 0.0
> 0.5
> 0.5
> 0.0
the sum of which is: 8.5 (The cumulative distibution function. Should be 1.)
So there are two bugs:
The values are not properly scaled. Each number should be around one tenth of their current values. In fact, if the number of observation increases by 10^n n = 1, 2, ... then the cdf also increases by 10^n
For example:
> kernelDensity(rand(1000))
> 953.53
They don't sum up to 10 (or one if it were not for the scaling error). The error becomes more evident as the sample size increases: there are approx. 5% of the observations not being included.
I believe that I implemented the formula 1:1, hence I really don't understand where the error is.
I'm not an expert on KDEs, so take all of this with a grain of salt, but a very similar (but much faster!) implementation of your code would be:
function kernelDensity{T<:AbstractFloat}(data::Vector{T}, h::T)
res = similar(data)
lb = minimum(data); ub = maximum(data)
for (i,x) in enumerate(linspace(lb, ub, size(data,1)))
for obs in data
res[i] += abs((obs-x)/h) <= 1. ? 0.5 : 0.
end
res[i] /= (n*h)
end
sum(res)
end
If I'm not mistaken, the density estimate should integrate to 1, that is we would expect kernelDensity(rand(100), 0.1)/100 to get at least close to 1. In the implementation above I'm getting there, give or take 5%, but then again we don't know that 0.1 is the optimal bandwith (using h=0.135 instead I'm getting there to within 0.1%), and the uniform Kernel is known to only be about 93% "efficient".
In any case, there's a very good Kernel Density package in Julia available here, so you probably should just do Pkg.add("KernelDensity") instead of trying to code your own Epanechnikov kernel :)
To point out the mistake: You have n bins B_i of size 2h covering [0,1], a random point X lands in expected number of bins. You divide by 2 n h.
For n points, the expected value of your function is .
Actually, you have some bins of size < 2h. (for example if start = 0, half of first the bin is outside of [0,1]), factoring this in gives the bias.
Edit: Btw, the bias is easy to calculate if you assume that the bins have random locations in [0,1]. Then the bins are on average missing h/2 = 5% of their size.

Scoring System Suggestion - weighted mechanism?

I'm trying to validate a series of words that are provided by users. I'm trying to come up with a scoring system that will determine the likelihood that the series of words are indeed valid words.
Assume the following input:
xxx yyy zzz
The first thing I do is check each word individually against a database of words that I have. So, let's say that xxx was in the database, so we are 100% sure it's a valid word. Then let's say that yyy doesn't exist in the database, but a possible variation of its spelling exist (say yyyy). We don't give yyy a score of 100%, but maybe something lower (let's say 90%). Then zzz just doesn't exist at all in the database. So, zzz gets a score of 0%.
So we have something like this:
xxx = 100%
yyy = 90%
zzz = 0%
Assume further that the users are either going to either:
Provide a list of all valid words (most likely)
Provide a list of all invalid words (likely)
Provide a list of a mix of valid and invalid words (not likely)
As a whole, what is a good scoring system to determine a confidence score that xxx yyy zzz is a series of valid words? I'm not looking for anything too complex, but getting the average of the scores doesn't seem right. If some words in the list of words are valid, I think it increases the likelihood that the word not found in the database is an actual word also (it's just a limitation of the database that it doesn't contain that particular word).
NOTE: The input will generally be a minimum of 2 words (and mostly 2 words), but can be 3, 4, 5 (and maybe even more in some rare cases).
EDIT I have added a new section looking at discriminating word groups into English and non-English groups. This is below the section on estimating whether any given word is English.
I think you intuit that the scoring system you've explained here doesn't quite do justice to this problem.
It's great to find words that are in the dictionary - those words can be immediately give 100% and passed over, but what about non-matching words? How can you determine their probability?
This can be explained by a simple comparison between sentences comprising exactly the same letters:
Abergrandly recieved wuzkinds
Erbdnerye wcgluszaaindid vker
Neither sentence has any English words, but the first sentence looks English - it might be about someone (Abergrandly) who received (there was a spelling mistake) several items (wuzkinds). The second sentence is clearly just my infant hitting the keyboard.
So, in the example above, even though there is no English word present, the probability it's spoken by an English speaker is high. The second sentence has a 0% probability of being English.
I know a couple of heuristics to help detect the difference:
Simple frequency analysis of letters
In any language, some letters are more common than others. Simply counting the incidence of each letter and comparing it to the languages average tells us a lot.
There are several ways you could calculate a probability from it. One might be:
Preparation
Compute or obtain the frequencies of letters in a suitable English corpus. The NLTK is an excellent way to begin. The associated Natural Language Processing with Python book is very informative.
The Test
Count the number of occurrences of each letter in the phrase to test
Compute the Linear regression where the co-ordinate of each letter-point is:
X axis: Its predicted frequency from 1.1 above
Y axis: The actual count
Perform a Regression Analysis on the data
English should report a positive r close to 1.0. Compute the R^2 as a probability that this is English.
An r of 0 or below is either no correlation to English, or the letters have a negative correlation. Not likely English.
Advantages:
Very simple to calculate
Disadvantages:
Will not work so well for small samples, eg "zebra, xylophone"
"Rrressseee" would seem a highly probably word
Does not discriminate between the two example sentences I gave above.
Bigram frequencies and Trigram frequencies
This is an extension of letter frequencies, but looks at the frequency of letter pairs or triplets. For example, a u follows a q with 99% frequency (why not 100%? dafuq). Again, the NLTK corpus is incredibly useful.
Above from: http://www.math.cornell.edu/~mec/2003-2004/cryptography/subs/digraphs.jpg
This approach is widely used across the industry, in everything from speech recognition to predictive text on your soft keyboard.
Trigraphs are especially useful. Consider that 'll' is a very common digraph. The string 'lllllllll' therefore consists of only common digraphs and the digraph approach makes it look like a word. Trigraphs resolve this because 'lll' never occurs.
The calculation of this probability of a word using trigraphs can't be done with a simple linear regression model (the vast majority of trigrams will not be present in the word and so the majority of points will be on the x axis). Instead you can use Markov chains (using a probability matrix of either bigrams or trigrams) to compute the probability of a word. An introduction to Markov chains is here.
First build a matrix of probabilities:
X axis: Every bigram ("th", "he", "in", "er", "an", etc)
Y axis: The letters of the alphabet.
The matrix members consist of the probability of the letter of the alphabet following the bigraph.
To start computing probabilities from the start of the word, the X axis digraphs need to include spaces-a, space-b up to space-z - eg the digraph "space" t represents a word starting t.
Computing the probability of the word consists of iterating over digraphs and obtaining the probability of the third letter given the digraph. For example, the word "they" is broken down into the following probabilities:
h following "space" t -> probability x%
e following th -> probability y%
y following he -> probability z%
Overall probability = x * y * z %
This computation solves the issues for a simple frequency analysis by highlighting the "wcgl" as having a 0% probability.
Note that the probability of any given word will be very small and becomes statistically smaller by between 10x to 20x per extra character. However, examining the probability of known English words of 3, 4, 5, 6, etc characters from a large corpus, you can determine a cutoff below which the word is highly unlikely. Each highly unlikely trigraph will drop the likelihood of being English by 1 to 2 orders of magnitude.
You might then normalize the probability of a word, for example, for 8-letter English words (I've made up the numbers below):
Probabilities from Markov chain:
Probability of the best English word = 10^-7 (10% * 10% * .. * 10%)
Cutoff (Probability of least likely English word) = 10^-14 (1% * 1% * .. * 1%)
Probability for test word (say "coattail") = 10^-12
'Normalize' results
Take logs: Best = -7; Test = -12; Cutoff = -14
Make positive: Best = 7; Test = 2; Cutoff = 0
Normalize between 1.0 and 0.0: Best = 1.0; Test = 0.28; Cutoff = 0.0
(You can easily adjust the higher and lower bounds to, say, between 90% and 10%)
Now we've examined how to get a better probability that any given word is English, let's look at the group of words.
The group's definition is that it's a minimum of 2 words, but can be 3, 4, 5 or (in a small number of cases) more. You don't mention that there is any overriding structure or associations between the words, so I am not assuming:
That any group is a phrase, eg "tank commander", "red letter day"
That the group is a sentence or clause, eg " I am thirsty", "Mary needs an email"
However if this assumption is wrong, then the problem becomes more tractable for larger word-groups because the words will conform to English's rules of syntax - we can use, say, the NLTK to parse the clause to gain more insight.
Looking at the probability that a group of words is English
OK, in order to get a feel of the problem, let's look at different use cases. In the following:
I am going to ignore the cases of all words or all not-words as those cases are trivial
I will consider English-like words that you can't be assumed to be in a dictionary, like weird surnames (eg Kardashian), unusual product names (eg stackexchange) and so on.
I will use simple averages of the probabilities assuming that random gibberish is 0% while English-like words are at 90%.
Two words
(50%) Red ajkhsdjas
(50%) Hkdfs Friday
(95%) Kardashians program
(95%) Using Stackexchange
From these examples, I think you would agree that 1. and 2. are likely not acceptable whereas 3. and 4. are. The simple average calculation appears a useful discriminator for two word groups.
Three words
With one suspect words:
(67%) Red dawn dskfa
(67%) Hskdkc communist manifesto
(67%) Economic jasdfh crisis
(97%) Kardashian fifteen minutes
(97%) stackexchange user experience
Clearly 4. and 5. are acceptable.
But what about 1., 2. or 3.? Are there any material differences between 1., 2. or 3.? Probably not, ruling out using Baysian statistics. But should these be classified as English or not? I think that's your call.
With two suspect words:
(33%) Red ksadjak adsfhd
(33%) jkdsfk dsajjds manifesto
(93%) Stackexchange emails Kardashians
(93%) Stackexchange Kardashian account
I would hazard that 1. and 2. are not acceptable, but 3. and 4 definitely are. (Well, except the Kardashians' having an account here - that does not bode well). Again the simple averages can be used as a simple discriminator - and you can choose if it's above or below 67%.
Four words
The number of permutations starts getting wild, so I'll give only a few examples:
One suspect word:
(75%) Programming jhjasd language today
(93%) Hopeless Kardashian tv series
Two suspect words:
(50%) Programming kasdhjk jhsaer today
(95%) Stackexchange implementing Kasdashian filter
Three suspect words:
(25%) Programming sdajf jkkdsf kuuerc
(93%) Stackexchange bitifying Kardashians tweetdeck
In my mind, it's clear which word groups are meaningful aligns with the simple average with the exception of 2.1 - that's again your call.
Interestingly the cutoff point for four word groups might be different from three-word groups, so I'd recommend that your implementation has different a configuration setting for each group. Having different cutoffs is a consequence that the quantum jump from 2->3 and then 3->4 does not mesh with the idea of smooth, continuous probabilities.
Implementing different cutoff values for these groups directly addresses your intuition "Right now, I just have a "gut" feeling that my xxx yyy zzz example really should be higher than 66.66%, but I'm not sure how to express it as a formula.".
Five words
You get the idea - I'm not going to enumerate any more here. However, as you get to five words, it starts to get enough structure that several new heuristics can come in:
Use of Bayesian probabilities/statistics (what is the probability of the third word being a word given that the first two were?)
Parsing the group using the NLTK and looking at whether it makes grammatical sense
Problem cases
English has a number of very short words, and this might cause a problem. For example:
Gibberish: r xu r
Is this English? I am a
You may have to write code to specifically test for 1 and 2 letter words.
TL;DR Summary
Non-dictionary words can be tested for how 'English' (or French or Spanish, etc) they are using letter and trigram frequencies. Picking up English-like words and attributing them a high score is critical to distinguish English groups
Up to four words, a simple average has great discriminatory power, but you probably want to set a different cutoff for 2 words, 3 words and 4 words.
Five words and above you can probably start using Bayesian statistics
Longer word groups if they should be sentences or sentence fragments can be tested using a natural language tool, such as NLTK.
This is a heuristic process and, ultimately, there will be confounding values (such as "I am a"). Writing an perfect statistical analysis routine may therefore not be especially useful compared to a simple average if it can be confounded by a large number of exceptions.
Perhaps you could use Bayes' formula.
You already have numerical guesses for the probability of each word to be real.
Next step is to make educated guesses about the probability of the entire list being good, bad or mixed (i.e., turn "most likely", "likely" and "not likely" into numbers.)
I'll give a bayesian hierarchical model solution. It has a few parameters that must be set by hand, but it is quite robust regarding these parameters, as the simulation below shows. And it can handle not only the scoring system for the word list, but also a probable classification of the user who entered the words. The treatment may be a little technical, but in the end we'll have a routine to calculate the scores as a function of 3 numbers: the number of words in the list, the number of those with an exact match in the database, and the number of those with a partial matching (as in yyyy). The routine is implemented in R, ,but if you never used it, just download the interpreter, copy and paste the code in it's console, and you'll see the results shown here.
BTW english is not my first language, so bear with me... :-)
1. Model Specification:
There are 3 classes of users, named I, II, III. We assume that each word list is generated by a single user, and that the user is drawn randomly from a universe of users. We say that this universe is 70% class I, 25% class II and 5% class III. These numbers can be changed, of course. We have so far
Prob[User=I] = 70%
Prob[User=II] = 25%
Prob[User=III] = 5%
Given the user, we assume conditional independence, i.e., the user will not look to previous words to decide if he'll type in a valid or invalid word.
User I tends to give only valid words, User II only invalid words, and user III is mixed. So we set
Prob[Word=OK | User=I] = 99%
Prob[Word=OK | User=II] = 0.001%
Prob[Word=OK | User=III] = 50%
The probabilities of the word being invalid, given the class of the user, are complimentary. Note that we give a very small, but non-zero probability of a class-II user entering valid words, since even a monkey in front of a typewriter will, eventually type a valid word.
The final step of the model specification regards the database. We assume that, for each word, the query may have 3 outcomes: a total match, a partial match (as in yyyy ) or no match. In probability terms, we assume that
Prob[match | valid] = 98% (not all valid words will be found)
Prob[partial | valid] = 0.2% (a rare event)
Prob[match | INvalid] = 0 (the database may be incomplete, but it has no invalid words)
Prob[partial | INvalid] = 0.1% (a rare event)
The probabilities of not finding the word don't have to be set, as they are complimentary. That's it, our model is set.
2. Notation and Objective
We have a discrete random variable U, taking values in {1, 2, 3} and two discrete random vectors W and F, each of size n (= the number of words), where W_i is 1 if the word is valid and 2 if the word is invalid, and F_i is 1 if the word is found in the database, 2 if it's a partial match and 3 if it's not found.
Only vector F is observable, the others are latent. Using Bayes theorem and the distributions we set up in the model specification, we can calculate
(a) Prob[User=I | F],
i. e., the posterior probability of the user being in class I, given the observed matchings; and
(b) Prob[W=all valid | F],
i. e., the posterior probability that all words are valid, given the observed matchings.
Depending on your objective, you can use one or another as a scoring solution. If you are interested in distinguishing a real user from a computer program, for instance, you can use (a). If you only care about the word list being valid, you should use (b).
I'll try to explain shortly the theory in the next section, but this is the usual setup in the context of bayesian hierarchical models. The reference is Gelman (2004), "Bayesian Data Analysis".
If you want, you can jump to section 4, with the code.
3. The Math
I'll use a slight abuse of notation, as usual in this context, writing
p(x|y) for Prob[X=x|Y=y] and p(x,y) for Prob[X=x,Y=y].
The goal (a) is to calculate p(u|f), for u=1. Using Bayes theorem:
p(u|f) = p(u,f)/p(f) = p(f|u)p(u)/p(f).
p(u) is given. p(f|u) is obtained from:
p(f|u) = \prod_{i=1}^{n} \sum_{w_i=1}^{2} (p(f_i|w_i)p(w_i|u))
p(f|u) = \prod_{i=1}^{n} p(f_i|u)
= p(f_i=1|u)^(m) p(f_i=2|u)^(p) p(f_i=3)^(n-m-p)
where m = number of matchings and p = number of partial matchings.
p(f) is calculated as:
\sum_{u=1}^{3} p(f|u)p(u)
All these can be calculated directly.
Goal (b) is given by
p(w|f) = p(f|w)*p(w)/p(f)
where
p(f|w) = \prod_{i=1}^{n} p(f_i|w_i)
and p(f_i|w_i) is given in the model specification.
p(f) was calculated above, so we need only
p(w) = \sum_{u=1}^{3} p(w|u)p(u)
where
p(w|u) = \prod_{i=1}^{n} p(w_i|u)
So everything is set for implementation.
4. The Code
The code is written as a R script, the constants are set at the beginning, in accordance to what was discussed above, and the output is given by the functions
(a) p.u_f(u, n, m, p)
and
(b) p.wOK_f(n, m, p)
that calculate the probabilities for options (a) and (b), given inputs:
u = desired user class (set to u=1)
n = number of words
m = number of matchings
p = number of partial matchings
The code itself:
### Constants:
# User:
# Prob[U=1], Prob[U=2], Prob[U=3]
Prob_user = c(0.70, 0.25, 0.05)
# Words:
# Prob[Wi=OK|U=1,2,3]
Prob_OK = c(0.99, 0.001, 0.5)
Prob_NotOK = 1 - Prob_OK
# Database:
# Prob[Fi=match|Wi=OK], Prob[Fi=match|Wi=NotOK]:
Prob_match = c(0.98, 0)
# Prob[Fi=partial|Wi=OK], Prob[Fi=partial|Wi=NotOK]:
Prob_partial = c(0.002, 0.001)
# Prob[Fi=NOmatch|Wi=OK], Prob[Fi=NOmatch|Wi=NotOK]:
Prob_NOmatch = 1 - Prob_match - Prob_partial
###### First Goal: Probability of being a user type I, given the numbers of matchings (m) and partial matchings (p).
# Prob[Fi=fi|U=u]
#
p.fi_u <- function(fi, u)
{
unname(rbind(Prob_match, Prob_partial, Prob_NOmatch) %*% rbind(Prob_OK, Prob_NotOK))[fi,u]
}
# Prob[F=f|U=u]
#
p.f_u <- function(n, m, p, u)
{
exp( log(p.fi_u(1, u))*m + log(p.fi_u(2, u))*p + log(p.fi_u(3, u))*(n-m-p) )
}
# Prob[F=f]
#
p.f <- function(n, m, p)
{
p.f_u(n, m, p, 1)*Prob_user[1] + p.f_u(n, m, p, 2)*Prob_user[2] + p.f_u(n, m, p, 3)*Prob_user[3]
}
# Prob[U=u|F=f]
#
p.u_f <- function(u, n, m, p)
{
p.f_u(n, m, p, u) * Prob_user[u] / p.f(n, m, p)
}
# Probability user type I for n=1,...,5:
for(n in 1:5) for(m in 0:n) for(p in 0:(n-m))
{
cat("n =", n, "| m =", m, "| p =", p, "| Prob type I =", p.u_f(1, n, m, p), "\n")
}
##################################################################################################
# Second Goal: Probability all words OK given matchings/partial matchings.
p.f_wOK <- function(n, m, p)
{
exp( log(Prob_match[1])*m + log(Prob_partial[1])*p + log(Prob_NOmatch[1])*(n-m-p) )
}
p.wOK <- function(n)
{
sum(exp( log(Prob_OK)*n + log(Prob_user) ))
}
p.wOK_f <- function(n, m, p)
{
p.f_wOK(n, m, p)*p.wOK(n)/p.f(n, m, p)
}
# Probability all words ok for n=1,...,5:
for(n in 1:5) for(m in 0:n) for(p in 0:(n-m))
{
cat("n =", n, "| m =", m, "| p =", p, "| Prob all OK =", p.wOK_f(n, m, p), "\n")
}
5. Results
This are the results for n=1,...,5, and all possibilities for m and p. For instance, if you have 3 words, one match, one partial match, and one not found, you can be 66,5% sure it's a class-I user. In the same situation, you can attribute a score of 42,8% that all words are valid.
Note that option (a) does not give 100% score to the case of all matches, but option (b) does. This is expected, since we assumed that the database has no invalid words, hence if they are all found, then they are all valid. OTOH, there is a small chance that a user in class II or III can enter all valid words, but this chance decreases rapidly as n increases.
(a)
n = 1 | m = 0 | p = 0 | Prob type I = 0.06612505
n = 1 | m = 0 | p = 1 | Prob type I = 0.8107086
n = 1 | m = 1 | p = 0 | Prob type I = 0.9648451
n = 2 | m = 0 | p = 0 | Prob type I = 0.002062543
n = 2 | m = 0 | p = 1 | Prob type I = 0.1186027
n = 2 | m = 0 | p = 2 | Prob type I = 0.884213
n = 2 | m = 1 | p = 0 | Prob type I = 0.597882
n = 2 | m = 1 | p = 1 | Prob type I = 0.9733557
n = 2 | m = 2 | p = 0 | Prob type I = 0.982106
n = 3 | m = 0 | p = 0 | Prob type I = 5.901733e-05
n = 3 | m = 0 | p = 1 | Prob type I = 0.003994149
n = 3 | m = 0 | p = 2 | Prob type I = 0.200601
n = 3 | m = 0 | p = 3 | Prob type I = 0.9293284
n = 3 | m = 1 | p = 0 | Prob type I = 0.07393334
n = 3 | m = 1 | p = 1 | Prob type I = 0.665019
n = 3 | m = 1 | p = 2 | Prob type I = 0.9798274
n = 3 | m = 2 | p = 0 | Prob type I = 0.7500993
n = 3 | m = 2 | p = 1 | Prob type I = 0.9864524
n = 3 | m = 3 | p = 0 | Prob type I = 0.990882
n = 4 | m = 0 | p = 0 | Prob type I = 1.66568e-06
n = 4 | m = 0 | p = 1 | Prob type I = 0.0001158324
n = 4 | m = 0 | p = 2 | Prob type I = 0.007636577
n = 4 | m = 0 | p = 3 | Prob type I = 0.3134207
n = 4 | m = 0 | p = 4 | Prob type I = 0.9560934
n = 4 | m = 1 | p = 0 | Prob type I = 0.004198015
n = 4 | m = 1 | p = 1 | Prob type I = 0.09685249
n = 4 | m = 1 | p = 2 | Prob type I = 0.7256616
n = 4 | m = 1 | p = 3 | Prob type I = 0.9847408
n = 4 | m = 2 | p = 0 | Prob type I = 0.1410053
n = 4 | m = 2 | p = 1 | Prob type I = 0.7992839
n = 4 | m = 2 | p = 2 | Prob type I = 0.9897541
n = 4 | m = 3 | p = 0 | Prob type I = 0.855978
n = 4 | m = 3 | p = 1 | Prob type I = 0.9931117
n = 4 | m = 4 | p = 0 | Prob type I = 0.9953741
n = 5 | m = 0 | p = 0 | Prob type I = 4.671933e-08
n = 5 | m = 0 | p = 1 | Prob type I = 3.289577e-06
n = 5 | m = 0 | p = 2 | Prob type I = 0.0002259559
n = 5 | m = 0 | p = 3 | Prob type I = 0.01433312
n = 5 | m = 0 | p = 4 | Prob type I = 0.4459982
n = 5 | m = 0 | p = 5 | Prob type I = 0.9719289
n = 5 | m = 1 | p = 0 | Prob type I = 0.0002158996
n = 5 | m = 1 | p = 1 | Prob type I = 0.005694145
n = 5 | m = 1 | p = 2 | Prob type I = 0.1254661
n = 5 | m = 1 | p = 3 | Prob type I = 0.7787294
n = 5 | m = 1 | p = 4 | Prob type I = 0.988466
n = 5 | m = 2 | p = 0 | Prob type I = 0.00889696
n = 5 | m = 2 | p = 1 | Prob type I = 0.1788336
n = 5 | m = 2 | p = 2 | Prob type I = 0.8408416
n = 5 | m = 2 | p = 3 | Prob type I = 0.9922575
n = 5 | m = 3 | p = 0 | Prob type I = 0.2453087
n = 5 | m = 3 | p = 1 | Prob type I = 0.8874493
n = 5 | m = 3 | p = 2 | Prob type I = 0.994799
n = 5 | m = 4 | p = 0 | Prob type I = 0.9216786
n = 5 | m = 4 | p = 1 | Prob type I = 0.9965092
n = 5 | m = 5 | p = 0 | Prob type I = 0.9976583
(b)
n = 1 | m = 0 | p = 0 | Prob all OK = 0.04391523
n = 1 | m = 0 | p = 1 | Prob all OK = 0.836025
n = 1 | m = 1 | p = 0 | Prob all OK = 1
n = 2 | m = 0 | p = 0 | Prob all OK = 0.0008622994
n = 2 | m = 0 | p = 1 | Prob all OK = 0.07699368
n = 2 | m = 0 | p = 2 | Prob all OK = 0.8912977
n = 2 | m = 1 | p = 0 | Prob all OK = 0.3900892
n = 2 | m = 1 | p = 1 | Prob all OK = 0.9861099
n = 2 | m = 2 | p = 0 | Prob all OK = 1
n = 3 | m = 0 | p = 0 | Prob all OK = 1.567032e-05
n = 3 | m = 0 | p = 1 | Prob all OK = 0.001646751
n = 3 | m = 0 | p = 2 | Prob all OK = 0.1284228
n = 3 | m = 0 | p = 3 | Prob all OK = 0.923812
n = 3 | m = 1 | p = 0 | Prob all OK = 0.03063598
n = 3 | m = 1 | p = 1 | Prob all OK = 0.4278888
n = 3 | m = 1 | p = 2 | Prob all OK = 0.9789305
n = 3 | m = 2 | p = 0 | Prob all OK = 0.485069
n = 3 | m = 2 | p = 1 | Prob all OK = 0.990527
n = 3 | m = 3 | p = 0 | Prob all OK = 1
n = 4 | m = 0 | p = 0 | Prob all OK = 2.821188e-07
n = 4 | m = 0 | p = 1 | Prob all OK = 3.046322e-05
n = 4 | m = 0 | p = 2 | Prob all OK = 0.003118531
n = 4 | m = 0 | p = 3 | Prob all OK = 0.1987396
n = 4 | m = 0 | p = 4 | Prob all OK = 0.9413746
n = 4 | m = 1 | p = 0 | Prob all OK = 0.001109629
n = 4 | m = 1 | p = 1 | Prob all OK = 0.03975118
n = 4 | m = 1 | p = 2 | Prob all OK = 0.4624648
n = 4 | m = 1 | p = 3 | Prob all OK = 0.9744778
n = 4 | m = 2 | p = 0 | Prob all OK = 0.05816511
n = 4 | m = 2 | p = 1 | Prob all OK = 0.5119571
n = 4 | m = 2 | p = 2 | Prob all OK = 0.9843855
n = 4 | m = 3 | p = 0 | Prob all OK = 0.5510398
n = 4 | m = 3 | p = 1 | Prob all OK = 0.9927134
n = 4 | m = 4 | p = 0 | Prob all OK = 1
n = 5 | m = 0 | p = 0 | Prob all OK = 5.05881e-09
n = 5 | m = 0 | p = 1 | Prob all OK = 5.530918e-07
n = 5 | m = 0 | p = 2 | Prob all OK = 5.899106e-05
n = 5 | m = 0 | p = 3 | Prob all OK = 0.005810434
n = 5 | m = 0 | p = 4 | Prob all OK = 0.2807414
n = 5 | m = 0 | p = 5 | Prob all OK = 0.9499773
n = 5 | m = 1 | p = 0 | Prob all OK = 3.648353e-05
n = 5 | m = 1 | p = 1 | Prob all OK = 0.001494098
n = 5 | m = 1 | p = 2 | Prob all OK = 0.051119
n = 5 | m = 1 | p = 3 | Prob all OK = 0.4926606
n = 5 | m = 1 | p = 4 | Prob all OK = 0.9710204
n = 5 | m = 2 | p = 0 | Prob all OK = 0.002346281
n = 5 | m = 2 | p = 1 | Prob all OK = 0.07323064
n = 5 | m = 2 | p = 2 | Prob all OK = 0.5346423
n = 5 | m = 2 | p = 3 | Prob all OK = 0.9796679
n = 5 | m = 3 | p = 0 | Prob all OK = 0.1009589
n = 5 | m = 3 | p = 1 | Prob all OK = 0.5671273
n = 5 | m = 3 | p = 2 | Prob all OK = 0.9871377
n = 5 | m = 4 | p = 0 | Prob all OK = 0.5919764
n = 5 | m = 4 | p = 1 | Prob all OK = 0.9938288
n = 5 | m = 5 | p = 0 | Prob all OK = 1
If "average" is no solution because the database lacks of words, I'd say: extend the database :)
another idea could be, to 'weigh' the results, to get light an adjusted average, as an example:
100% = 1.00x weight
90% = 0.95x weight
80% = 0.90x weight
...
0% = 0.50x weight
so for your example you would:
(100*1 + 90*0.95 + 0*0.5) / (100*1 + 100*0.95 + 100*0.5) = 0.75714285714
=> 75.7%
regular average would be 63.3%
Since the order of words is not important in your description, the independent variable is the fraction of valid words. If the fraction is a perfect 1, i.e. all words are found to be perfect matches with the DB, then you are perfectly sure to have the all-valid outcome. If it's zero, i.e. all words are perfect misses in the DB, then you are perfectly sure to have the all-invalid outcome. If you have .5, then this must be the unlikely mixed-up outcome because neither of the other two is possible.
You say the mixed outcome is unlikely while the two extremes are moreso. You are after likelihood of the all-valid outcome.
Let the fraction of valid words (sum of "surenesses" of matches / # of words) be f and hence the desired likelihood of the all-valid outcome be L(f). By the discussion so far, we know L(1)=1 and L(f)=0 for 0<=f<=1/2 .
To honor your information that the mixed outcome is less likely than the all-valid (and the all-invalid) outcome, the shape of L must rise monotonically and quickly from 1/2 toward 1 and reach 1 at f=1.
Since this is heuristic, we might pick any reasonable function with this character. If we're clever it will have a parameter to control the steepness of the step and perhaps another for its location. This lets us tweak what "less likely" means for the middle case.
One such function is this for 1/2 <= f <= 1:
L(f) = 5 + f * (-24 + (36 - 16 * f) * f) + (-4 + f * (16 + f * (-20 + 8 * f))) * s
and zero for 0 <= f < 1/2. Although it's hairy-looking, it's the simplest polynomial that intersects (1/2,0) and (1,1) with slope 0 at f=1 and slope s at f=0.
You can set 0 <= s <= 3 to change the step shape. Here is a shot with s=3, which probably what you want:
If you set s > 3, it shoots above 1 before settling down, not what we want.
Of course there are infinitely many other possibilities. If this one does't work, comment and we'll look for another.
averaging is, of course, rubbish. If the individual word probabilities were accurate, the probability that all words are correct is simply the product, not the average. If you have an estimate for the uncertainties in your individual probabilities, you could work out their product marginalized over all the individual probabilities.

What is the best algorithm to find a determinant of a matrix?

Can anyone tell me which is the best algorithm to find the value of determinant of a matrix of size N x N?
Here is an extensive discussion.
There are a lot of algorithms.
A simple one is to take the LU decomposition. Then, since
det M = det LU = det L * det U
and both L and U are triangular, the determinant is a product of the diagonal elements of L and U. That is O(n^3). There exist more efficient algorithms.
Row Reduction
The simplest way (and not a bad way, really) to find the determinant of an nxn matrix is by row reduction. By keeping in mind a few simple rules about determinants, we can solve in the form:
det(A) = α * det(R), where R is the row echelon form of the original matrix A, and α is some coefficient.
Finding the determinant of a matrix in row echelon form is really easy; you just find the product of the diagonal. Solving the determinant of the original matrix A then just boils down to calculating α as you find the row echelon form R.
What You Need to Know
What is row echelon form?
See this [link](http://stattrek.com/matrix-algebra/echelon-form.aspx) for a simple definition
**Note:** Not all definitions require 1s for the leading entries, and it is unnecessary for this algorithm.
You Can Find R Using Elementary Row Operations
Swapping rows, adding multiples of another row, etc.
You Derive α from Properties of Row Operations for Determinants
If B is a matrix obtained by multiplying a row of A by some non-zero constant ß, then
det(B) = ß * det(A)
In other words, you can essentially 'factor out' a constant from a row by just pulling it out front of the determinant.
If B is a matrix obtained by swapping two rows of A, then
det(B) = -det(A)
If you swap rows, flip the sign.
If B is a matrix obtained by adding a multiple of one row to another row in A, then
det(B) = det(A)
The determinant doesn't change.
Note that you can find the determinant, in most cases, with only Rule 3 (when the diagonal of A has no zeros, I believe), and in all cases with only Rules 2 and 3. Rule 1 is helpful for humans doing math on paper, trying to avoid fractions.
Example
(I do unnecessary steps to demonstrate each rule more clearly)
| 2 3 3 1 |
A=| 0 4 3 -3 |
| 2 -1 -1 -3 |
| 0 -4 -3 2 |
R2 R3, -α -> α (Rule 2)
| 2 3 3 1 |
-| 2 -1 -1 -3 |
| 0 4 3 -3 |
| 0 -4 -3 2 |
R2 - R1 -> R2 (Rule 3)
| 2 3 3 1 |
-| 0 -4 -4 -4 |
| 0 4 3 -3 |
| 0 -4 -3 2 |
R2/(-4) -> R2, -4α -> α (Rule 1)
| 2 3 3 1 |
4| 0 1 1 1 |
| 0 4 3 -3 |
| 0 -4 -3 2 |
R3 - 4R2 -> R3, R4 + 4R2 -> R4 (Rule 3, applied twice)
| 2 3 3 1 |
4| 0 1 1 1 |
| 0 0 -1 -7 |
| 0 0 1 6 |
R4 + R3 -> R3
| 2 3 3 1 |
4| 0 1 1 1 | = 4 ( 2 * 1 * -1 * -1 ) = 8
| 0 0 -1 -7 |
| 0 0 0 -1 |
def echelon_form(A, size):
for i in range(size - 1):
for j in range(size - 1, i, -1):
if A[j][i] == 0:
continue
else:
try:
req_ratio = A[j][i] / A[j - 1][i]
# A[j] = A[j] - req_ratio*A[j-1]
except ZeroDivisionError:
# A[j], A[j-1] = A[j-1], A[j]
for x in range(size):
temp = A[j][x]
A[j][x] = A[j-1][x]
A[j-1][x] = temp
continue
for k in range(size):
A[j][k] = A[j][k] - req_ratio * A[j - 1][k]
return A
If you did an initial research, you've probably found that with N>=4, calculation of a matrix determinant becomes quite complex. Regarding algorithms, I would point you to Wikipedia article on Matrix determinants, specifically the "Algorithmic Implementation" section.
From my own experience, you can easily find a LU or QR decomposition algorithm in existing matrix libraries such as Alglib. The algorithm itself is not quite simple though.
I am not too familiar with LU factorization, but I know that in order to get either L or U, you need to make the initial matrix triangular (either upper triangular for U or lower triangular for L). However, once you get the matrix in triangular form for some nxn matrix A and assuming the only operation your code uses is Rb - k*Ra, you can just solve det(A) = Π T(i,i) from i=0 to n (i.e. det(A) = T(0,0) x T(1,1) x ... x T(n,n)) for the triangular matrix T. Check this link to see what I'm talking about. http://matrix.reshish.com/determinant.php

Resources