Scoring System Suggestion - weighted mechanism? - algorithm

I'm trying to validate a series of words that are provided by users. I'm trying to come up with a scoring system that will determine the likelihood that the series of words are indeed valid words.
Assume the following input:
xxx yyy zzz
The first thing I do is check each word individually against a database of words that I have. So, let's say that xxx was in the database, so we are 100% sure it's a valid word. Then let's say that yyy doesn't exist in the database, but a possible variation of its spelling exist (say yyyy). We don't give yyy a score of 100%, but maybe something lower (let's say 90%). Then zzz just doesn't exist at all in the database. So, zzz gets a score of 0%.
So we have something like this:
xxx = 100%
yyy = 90%
zzz = 0%
Assume further that the users are either going to either:
Provide a list of all valid words (most likely)
Provide a list of all invalid words (likely)
Provide a list of a mix of valid and invalid words (not likely)
As a whole, what is a good scoring system to determine a confidence score that xxx yyy zzz is a series of valid words? I'm not looking for anything too complex, but getting the average of the scores doesn't seem right. If some words in the list of words are valid, I think it increases the likelihood that the word not found in the database is an actual word also (it's just a limitation of the database that it doesn't contain that particular word).
NOTE: The input will generally be a minimum of 2 words (and mostly 2 words), but can be 3, 4, 5 (and maybe even more in some rare cases).

EDIT I have added a new section looking at discriminating word groups into English and non-English groups. This is below the section on estimating whether any given word is English.
I think you intuit that the scoring system you've explained here doesn't quite do justice to this problem.
It's great to find words that are in the dictionary - those words can be immediately give 100% and passed over, but what about non-matching words? How can you determine their probability?
This can be explained by a simple comparison between sentences comprising exactly the same letters:
Abergrandly recieved wuzkinds
Erbdnerye wcgluszaaindid vker
Neither sentence has any English words, but the first sentence looks English - it might be about someone (Abergrandly) who received (there was a spelling mistake) several items (wuzkinds). The second sentence is clearly just my infant hitting the keyboard.
So, in the example above, even though there is no English word present, the probability it's spoken by an English speaker is high. The second sentence has a 0% probability of being English.
I know a couple of heuristics to help detect the difference:
Simple frequency analysis of letters
In any language, some letters are more common than others. Simply counting the incidence of each letter and comparing it to the languages average tells us a lot.
There are several ways you could calculate a probability from it. One might be:
Preparation
Compute or obtain the frequencies of letters in a suitable English corpus. The NLTK is an excellent way to begin. The associated Natural Language Processing with Python book is very informative.
The Test
Count the number of occurrences of each letter in the phrase to test
Compute the Linear regression where the co-ordinate of each letter-point is:
X axis: Its predicted frequency from 1.1 above
Y axis: The actual count
Perform a Regression Analysis on the data
English should report a positive r close to 1.0. Compute the R^2 as a probability that this is English.
An r of 0 or below is either no correlation to English, or the letters have a negative correlation. Not likely English.
Advantages:
Very simple to calculate
Disadvantages:
Will not work so well for small samples, eg "zebra, xylophone"
"Rrressseee" would seem a highly probably word
Does not discriminate between the two example sentences I gave above.
Bigram frequencies and Trigram frequencies
This is an extension of letter frequencies, but looks at the frequency of letter pairs or triplets. For example, a u follows a q with 99% frequency (why not 100%? dafuq). Again, the NLTK corpus is incredibly useful.
Above from: http://www.math.cornell.edu/~mec/2003-2004/cryptography/subs/digraphs.jpg
This approach is widely used across the industry, in everything from speech recognition to predictive text on your soft keyboard.
Trigraphs are especially useful. Consider that 'll' is a very common digraph. The string 'lllllllll' therefore consists of only common digraphs and the digraph approach makes it look like a word. Trigraphs resolve this because 'lll' never occurs.
The calculation of this probability of a word using trigraphs can't be done with a simple linear regression model (the vast majority of trigrams will not be present in the word and so the majority of points will be on the x axis). Instead you can use Markov chains (using a probability matrix of either bigrams or trigrams) to compute the probability of a word. An introduction to Markov chains is here.
First build a matrix of probabilities:
X axis: Every bigram ("th", "he", "in", "er", "an", etc)
Y axis: The letters of the alphabet.
The matrix members consist of the probability of the letter of the alphabet following the bigraph.
To start computing probabilities from the start of the word, the X axis digraphs need to include spaces-a, space-b up to space-z - eg the digraph "space" t represents a word starting t.
Computing the probability of the word consists of iterating over digraphs and obtaining the probability of the third letter given the digraph. For example, the word "they" is broken down into the following probabilities:
h following "space" t -> probability x%
e following th -> probability y%
y following he -> probability z%
Overall probability = x * y * z %
This computation solves the issues for a simple frequency analysis by highlighting the "wcgl" as having a 0% probability.
Note that the probability of any given word will be very small and becomes statistically smaller by between 10x to 20x per extra character. However, examining the probability of known English words of 3, 4, 5, 6, etc characters from a large corpus, you can determine a cutoff below which the word is highly unlikely. Each highly unlikely trigraph will drop the likelihood of being English by 1 to 2 orders of magnitude.
You might then normalize the probability of a word, for example, for 8-letter English words (I've made up the numbers below):
Probabilities from Markov chain:
Probability of the best English word = 10^-7 (10% * 10% * .. * 10%)
Cutoff (Probability of least likely English word) = 10^-14 (1% * 1% * .. * 1%)
Probability for test word (say "coattail") = 10^-12
'Normalize' results
Take logs: Best = -7; Test = -12; Cutoff = -14
Make positive: Best = 7; Test = 2; Cutoff = 0
Normalize between 1.0 and 0.0: Best = 1.0; Test = 0.28; Cutoff = 0.0
(You can easily adjust the higher and lower bounds to, say, between 90% and 10%)
Now we've examined how to get a better probability that any given word is English, let's look at the group of words.
The group's definition is that it's a minimum of 2 words, but can be 3, 4, 5 or (in a small number of cases) more. You don't mention that there is any overriding structure or associations between the words, so I am not assuming:
That any group is a phrase, eg "tank commander", "red letter day"
That the group is a sentence or clause, eg " I am thirsty", "Mary needs an email"
However if this assumption is wrong, then the problem becomes more tractable for larger word-groups because the words will conform to English's rules of syntax - we can use, say, the NLTK to parse the clause to gain more insight.
Looking at the probability that a group of words is English
OK, in order to get a feel of the problem, let's look at different use cases. In the following:
I am going to ignore the cases of all words or all not-words as those cases are trivial
I will consider English-like words that you can't be assumed to be in a dictionary, like weird surnames (eg Kardashian), unusual product names (eg stackexchange) and so on.
I will use simple averages of the probabilities assuming that random gibberish is 0% while English-like words are at 90%.
Two words
(50%) Red ajkhsdjas
(50%) Hkdfs Friday
(95%) Kardashians program
(95%) Using Stackexchange
From these examples, I think you would agree that 1. and 2. are likely not acceptable whereas 3. and 4. are. The simple average calculation appears a useful discriminator for two word groups.
Three words
With one suspect words:
(67%) Red dawn dskfa
(67%) Hskdkc communist manifesto
(67%) Economic jasdfh crisis
(97%) Kardashian fifteen minutes
(97%) stackexchange user experience
Clearly 4. and 5. are acceptable.
But what about 1., 2. or 3.? Are there any material differences between 1., 2. or 3.? Probably not, ruling out using Baysian statistics. But should these be classified as English or not? I think that's your call.
With two suspect words:
(33%) Red ksadjak adsfhd
(33%) jkdsfk dsajjds manifesto
(93%) Stackexchange emails Kardashians
(93%) Stackexchange Kardashian account
I would hazard that 1. and 2. are not acceptable, but 3. and 4 definitely are. (Well, except the Kardashians' having an account here - that does not bode well). Again the simple averages can be used as a simple discriminator - and you can choose if it's above or below 67%.
Four words
The number of permutations starts getting wild, so I'll give only a few examples:
One suspect word:
(75%) Programming jhjasd language today
(93%) Hopeless Kardashian tv series
Two suspect words:
(50%) Programming kasdhjk jhsaer today
(95%) Stackexchange implementing Kasdashian filter
Three suspect words:
(25%) Programming sdajf jkkdsf kuuerc
(93%) Stackexchange bitifying Kardashians tweetdeck
In my mind, it's clear which word groups are meaningful aligns with the simple average with the exception of 2.1 - that's again your call.
Interestingly the cutoff point for four word groups might be different from three-word groups, so I'd recommend that your implementation has different a configuration setting for each group. Having different cutoffs is a consequence that the quantum jump from 2->3 and then 3->4 does not mesh with the idea of smooth, continuous probabilities.
Implementing different cutoff values for these groups directly addresses your intuition "Right now, I just have a "gut" feeling that my xxx yyy zzz example really should be higher than 66.66%, but I'm not sure how to express it as a formula.".
Five words
You get the idea - I'm not going to enumerate any more here. However, as you get to five words, it starts to get enough structure that several new heuristics can come in:
Use of Bayesian probabilities/statistics (what is the probability of the third word being a word given that the first two were?)
Parsing the group using the NLTK and looking at whether it makes grammatical sense
Problem cases
English has a number of very short words, and this might cause a problem. For example:
Gibberish: r xu r
Is this English? I am a
You may have to write code to specifically test for 1 and 2 letter words.
TL;DR Summary
Non-dictionary words can be tested for how 'English' (or French or Spanish, etc) they are using letter and trigram frequencies. Picking up English-like words and attributing them a high score is critical to distinguish English groups
Up to four words, a simple average has great discriminatory power, but you probably want to set a different cutoff for 2 words, 3 words and 4 words.
Five words and above you can probably start using Bayesian statistics
Longer word groups if they should be sentences or sentence fragments can be tested using a natural language tool, such as NLTK.
This is a heuristic process and, ultimately, there will be confounding values (such as "I am a"). Writing an perfect statistical analysis routine may therefore not be especially useful compared to a simple average if it can be confounded by a large number of exceptions.

Perhaps you could use Bayes' formula.
You already have numerical guesses for the probability of each word to be real.
Next step is to make educated guesses about the probability of the entire list being good, bad or mixed (i.e., turn "most likely", "likely" and "not likely" into numbers.)

I'll give a bayesian hierarchical model solution. It has a few parameters that must be set by hand, but it is quite robust regarding these parameters, as the simulation below shows. And it can handle not only the scoring system for the word list, but also a probable classification of the user who entered the words. The treatment may be a little technical, but in the end we'll have a routine to calculate the scores as a function of 3 numbers: the number of words in the list, the number of those with an exact match in the database, and the number of those with a partial matching (as in yyyy). The routine is implemented in R, ,but if you never used it, just download the interpreter, copy and paste the code in it's console, and you'll see the results shown here.
BTW english is not my first language, so bear with me... :-)
1. Model Specification:
There are 3 classes of users, named I, II, III. We assume that each word list is generated by a single user, and that the user is drawn randomly from a universe of users. We say that this universe is 70% class I, 25% class II and 5% class III. These numbers can be changed, of course. We have so far
Prob[User=I] = 70%
Prob[User=II] = 25%
Prob[User=III] = 5%
Given the user, we assume conditional independence, i.e., the user will not look to previous words to decide if he'll type in a valid or invalid word.
User I tends to give only valid words, User II only invalid words, and user III is mixed. So we set
Prob[Word=OK | User=I] = 99%
Prob[Word=OK | User=II] = 0.001%
Prob[Word=OK | User=III] = 50%
The probabilities of the word being invalid, given the class of the user, are complimentary. Note that we give a very small, but non-zero probability of a class-II user entering valid words, since even a monkey in front of a typewriter will, eventually type a valid word.
The final step of the model specification regards the database. We assume that, for each word, the query may have 3 outcomes: a total match, a partial match (as in yyyy ) or no match. In probability terms, we assume that
Prob[match | valid] = 98% (not all valid words will be found)
Prob[partial | valid] = 0.2% (a rare event)
Prob[match | INvalid] = 0 (the database may be incomplete, but it has no invalid words)
Prob[partial | INvalid] = 0.1% (a rare event)
The probabilities of not finding the word don't have to be set, as they are complimentary. That's it, our model is set.
2. Notation and Objective
We have a discrete random variable U, taking values in {1, 2, 3} and two discrete random vectors W and F, each of size n (= the number of words), where W_i is 1 if the word is valid and 2 if the word is invalid, and F_i is 1 if the word is found in the database, 2 if it's a partial match and 3 if it's not found.
Only vector F is observable, the others are latent. Using Bayes theorem and the distributions we set up in the model specification, we can calculate
(a) Prob[User=I | F],
i. e., the posterior probability of the user being in class I, given the observed matchings; and
(b) Prob[W=all valid | F],
i. e., the posterior probability that all words are valid, given the observed matchings.
Depending on your objective, you can use one or another as a scoring solution. If you are interested in distinguishing a real user from a computer program, for instance, you can use (a). If you only care about the word list being valid, you should use (b).
I'll try to explain shortly the theory in the next section, but this is the usual setup in the context of bayesian hierarchical models. The reference is Gelman (2004), "Bayesian Data Analysis".
If you want, you can jump to section 4, with the code.
3. The Math
I'll use a slight abuse of notation, as usual in this context, writing
p(x|y) for Prob[X=x|Y=y] and p(x,y) for Prob[X=x,Y=y].
The goal (a) is to calculate p(u|f), for u=1. Using Bayes theorem:
p(u|f) = p(u,f)/p(f) = p(f|u)p(u)/p(f).
p(u) is given. p(f|u) is obtained from:
p(f|u) = \prod_{i=1}^{n} \sum_{w_i=1}^{2} (p(f_i|w_i)p(w_i|u))
p(f|u) = \prod_{i=1}^{n} p(f_i|u)
= p(f_i=1|u)^(m) p(f_i=2|u)^(p) p(f_i=3)^(n-m-p)
where m = number of matchings and p = number of partial matchings.
p(f) is calculated as:
\sum_{u=1}^{3} p(f|u)p(u)
All these can be calculated directly.
Goal (b) is given by
p(w|f) = p(f|w)*p(w)/p(f)
where
p(f|w) = \prod_{i=1}^{n} p(f_i|w_i)
and p(f_i|w_i) is given in the model specification.
p(f) was calculated above, so we need only
p(w) = \sum_{u=1}^{3} p(w|u)p(u)
where
p(w|u) = \prod_{i=1}^{n} p(w_i|u)
So everything is set for implementation.
4. The Code
The code is written as a R script, the constants are set at the beginning, in accordance to what was discussed above, and the output is given by the functions
(a) p.u_f(u, n, m, p)
and
(b) p.wOK_f(n, m, p)
that calculate the probabilities for options (a) and (b), given inputs:
u = desired user class (set to u=1)
n = number of words
m = number of matchings
p = number of partial matchings
The code itself:
### Constants:
# User:
# Prob[U=1], Prob[U=2], Prob[U=3]
Prob_user = c(0.70, 0.25, 0.05)
# Words:
# Prob[Wi=OK|U=1,2,3]
Prob_OK = c(0.99, 0.001, 0.5)
Prob_NotOK = 1 - Prob_OK
# Database:
# Prob[Fi=match|Wi=OK], Prob[Fi=match|Wi=NotOK]:
Prob_match = c(0.98, 0)
# Prob[Fi=partial|Wi=OK], Prob[Fi=partial|Wi=NotOK]:
Prob_partial = c(0.002, 0.001)
# Prob[Fi=NOmatch|Wi=OK], Prob[Fi=NOmatch|Wi=NotOK]:
Prob_NOmatch = 1 - Prob_match - Prob_partial
###### First Goal: Probability of being a user type I, given the numbers of matchings (m) and partial matchings (p).
# Prob[Fi=fi|U=u]
#
p.fi_u <- function(fi, u)
{
unname(rbind(Prob_match, Prob_partial, Prob_NOmatch) %*% rbind(Prob_OK, Prob_NotOK))[fi,u]
}
# Prob[F=f|U=u]
#
p.f_u <- function(n, m, p, u)
{
exp( log(p.fi_u(1, u))*m + log(p.fi_u(2, u))*p + log(p.fi_u(3, u))*(n-m-p) )
}
# Prob[F=f]
#
p.f <- function(n, m, p)
{
p.f_u(n, m, p, 1)*Prob_user[1] + p.f_u(n, m, p, 2)*Prob_user[2] + p.f_u(n, m, p, 3)*Prob_user[3]
}
# Prob[U=u|F=f]
#
p.u_f <- function(u, n, m, p)
{
p.f_u(n, m, p, u) * Prob_user[u] / p.f(n, m, p)
}
# Probability user type I for n=1,...,5:
for(n in 1:5) for(m in 0:n) for(p in 0:(n-m))
{
cat("n =", n, "| m =", m, "| p =", p, "| Prob type I =", p.u_f(1, n, m, p), "\n")
}
##################################################################################################
# Second Goal: Probability all words OK given matchings/partial matchings.
p.f_wOK <- function(n, m, p)
{
exp( log(Prob_match[1])*m + log(Prob_partial[1])*p + log(Prob_NOmatch[1])*(n-m-p) )
}
p.wOK <- function(n)
{
sum(exp( log(Prob_OK)*n + log(Prob_user) ))
}
p.wOK_f <- function(n, m, p)
{
p.f_wOK(n, m, p)*p.wOK(n)/p.f(n, m, p)
}
# Probability all words ok for n=1,...,5:
for(n in 1:5) for(m in 0:n) for(p in 0:(n-m))
{
cat("n =", n, "| m =", m, "| p =", p, "| Prob all OK =", p.wOK_f(n, m, p), "\n")
}
5. Results
This are the results for n=1,...,5, and all possibilities for m and p. For instance, if you have 3 words, one match, one partial match, and one not found, you can be 66,5% sure it's a class-I user. In the same situation, you can attribute a score of 42,8% that all words are valid.
Note that option (a) does not give 100% score to the case of all matches, but option (b) does. This is expected, since we assumed that the database has no invalid words, hence if they are all found, then they are all valid. OTOH, there is a small chance that a user in class II or III can enter all valid words, but this chance decreases rapidly as n increases.
(a)
n = 1 | m = 0 | p = 0 | Prob type I = 0.06612505
n = 1 | m = 0 | p = 1 | Prob type I = 0.8107086
n = 1 | m = 1 | p = 0 | Prob type I = 0.9648451
n = 2 | m = 0 | p = 0 | Prob type I = 0.002062543
n = 2 | m = 0 | p = 1 | Prob type I = 0.1186027
n = 2 | m = 0 | p = 2 | Prob type I = 0.884213
n = 2 | m = 1 | p = 0 | Prob type I = 0.597882
n = 2 | m = 1 | p = 1 | Prob type I = 0.9733557
n = 2 | m = 2 | p = 0 | Prob type I = 0.982106
n = 3 | m = 0 | p = 0 | Prob type I = 5.901733e-05
n = 3 | m = 0 | p = 1 | Prob type I = 0.003994149
n = 3 | m = 0 | p = 2 | Prob type I = 0.200601
n = 3 | m = 0 | p = 3 | Prob type I = 0.9293284
n = 3 | m = 1 | p = 0 | Prob type I = 0.07393334
n = 3 | m = 1 | p = 1 | Prob type I = 0.665019
n = 3 | m = 1 | p = 2 | Prob type I = 0.9798274
n = 3 | m = 2 | p = 0 | Prob type I = 0.7500993
n = 3 | m = 2 | p = 1 | Prob type I = 0.9864524
n = 3 | m = 3 | p = 0 | Prob type I = 0.990882
n = 4 | m = 0 | p = 0 | Prob type I = 1.66568e-06
n = 4 | m = 0 | p = 1 | Prob type I = 0.0001158324
n = 4 | m = 0 | p = 2 | Prob type I = 0.007636577
n = 4 | m = 0 | p = 3 | Prob type I = 0.3134207
n = 4 | m = 0 | p = 4 | Prob type I = 0.9560934
n = 4 | m = 1 | p = 0 | Prob type I = 0.004198015
n = 4 | m = 1 | p = 1 | Prob type I = 0.09685249
n = 4 | m = 1 | p = 2 | Prob type I = 0.7256616
n = 4 | m = 1 | p = 3 | Prob type I = 0.9847408
n = 4 | m = 2 | p = 0 | Prob type I = 0.1410053
n = 4 | m = 2 | p = 1 | Prob type I = 0.7992839
n = 4 | m = 2 | p = 2 | Prob type I = 0.9897541
n = 4 | m = 3 | p = 0 | Prob type I = 0.855978
n = 4 | m = 3 | p = 1 | Prob type I = 0.9931117
n = 4 | m = 4 | p = 0 | Prob type I = 0.9953741
n = 5 | m = 0 | p = 0 | Prob type I = 4.671933e-08
n = 5 | m = 0 | p = 1 | Prob type I = 3.289577e-06
n = 5 | m = 0 | p = 2 | Prob type I = 0.0002259559
n = 5 | m = 0 | p = 3 | Prob type I = 0.01433312
n = 5 | m = 0 | p = 4 | Prob type I = 0.4459982
n = 5 | m = 0 | p = 5 | Prob type I = 0.9719289
n = 5 | m = 1 | p = 0 | Prob type I = 0.0002158996
n = 5 | m = 1 | p = 1 | Prob type I = 0.005694145
n = 5 | m = 1 | p = 2 | Prob type I = 0.1254661
n = 5 | m = 1 | p = 3 | Prob type I = 0.7787294
n = 5 | m = 1 | p = 4 | Prob type I = 0.988466
n = 5 | m = 2 | p = 0 | Prob type I = 0.00889696
n = 5 | m = 2 | p = 1 | Prob type I = 0.1788336
n = 5 | m = 2 | p = 2 | Prob type I = 0.8408416
n = 5 | m = 2 | p = 3 | Prob type I = 0.9922575
n = 5 | m = 3 | p = 0 | Prob type I = 0.2453087
n = 5 | m = 3 | p = 1 | Prob type I = 0.8874493
n = 5 | m = 3 | p = 2 | Prob type I = 0.994799
n = 5 | m = 4 | p = 0 | Prob type I = 0.9216786
n = 5 | m = 4 | p = 1 | Prob type I = 0.9965092
n = 5 | m = 5 | p = 0 | Prob type I = 0.9976583
(b)
n = 1 | m = 0 | p = 0 | Prob all OK = 0.04391523
n = 1 | m = 0 | p = 1 | Prob all OK = 0.836025
n = 1 | m = 1 | p = 0 | Prob all OK = 1
n = 2 | m = 0 | p = 0 | Prob all OK = 0.0008622994
n = 2 | m = 0 | p = 1 | Prob all OK = 0.07699368
n = 2 | m = 0 | p = 2 | Prob all OK = 0.8912977
n = 2 | m = 1 | p = 0 | Prob all OK = 0.3900892
n = 2 | m = 1 | p = 1 | Prob all OK = 0.9861099
n = 2 | m = 2 | p = 0 | Prob all OK = 1
n = 3 | m = 0 | p = 0 | Prob all OK = 1.567032e-05
n = 3 | m = 0 | p = 1 | Prob all OK = 0.001646751
n = 3 | m = 0 | p = 2 | Prob all OK = 0.1284228
n = 3 | m = 0 | p = 3 | Prob all OK = 0.923812
n = 3 | m = 1 | p = 0 | Prob all OK = 0.03063598
n = 3 | m = 1 | p = 1 | Prob all OK = 0.4278888
n = 3 | m = 1 | p = 2 | Prob all OK = 0.9789305
n = 3 | m = 2 | p = 0 | Prob all OK = 0.485069
n = 3 | m = 2 | p = 1 | Prob all OK = 0.990527
n = 3 | m = 3 | p = 0 | Prob all OK = 1
n = 4 | m = 0 | p = 0 | Prob all OK = 2.821188e-07
n = 4 | m = 0 | p = 1 | Prob all OK = 3.046322e-05
n = 4 | m = 0 | p = 2 | Prob all OK = 0.003118531
n = 4 | m = 0 | p = 3 | Prob all OK = 0.1987396
n = 4 | m = 0 | p = 4 | Prob all OK = 0.9413746
n = 4 | m = 1 | p = 0 | Prob all OK = 0.001109629
n = 4 | m = 1 | p = 1 | Prob all OK = 0.03975118
n = 4 | m = 1 | p = 2 | Prob all OK = 0.4624648
n = 4 | m = 1 | p = 3 | Prob all OK = 0.9744778
n = 4 | m = 2 | p = 0 | Prob all OK = 0.05816511
n = 4 | m = 2 | p = 1 | Prob all OK = 0.5119571
n = 4 | m = 2 | p = 2 | Prob all OK = 0.9843855
n = 4 | m = 3 | p = 0 | Prob all OK = 0.5510398
n = 4 | m = 3 | p = 1 | Prob all OK = 0.9927134
n = 4 | m = 4 | p = 0 | Prob all OK = 1
n = 5 | m = 0 | p = 0 | Prob all OK = 5.05881e-09
n = 5 | m = 0 | p = 1 | Prob all OK = 5.530918e-07
n = 5 | m = 0 | p = 2 | Prob all OK = 5.899106e-05
n = 5 | m = 0 | p = 3 | Prob all OK = 0.005810434
n = 5 | m = 0 | p = 4 | Prob all OK = 0.2807414
n = 5 | m = 0 | p = 5 | Prob all OK = 0.9499773
n = 5 | m = 1 | p = 0 | Prob all OK = 3.648353e-05
n = 5 | m = 1 | p = 1 | Prob all OK = 0.001494098
n = 5 | m = 1 | p = 2 | Prob all OK = 0.051119
n = 5 | m = 1 | p = 3 | Prob all OK = 0.4926606
n = 5 | m = 1 | p = 4 | Prob all OK = 0.9710204
n = 5 | m = 2 | p = 0 | Prob all OK = 0.002346281
n = 5 | m = 2 | p = 1 | Prob all OK = 0.07323064
n = 5 | m = 2 | p = 2 | Prob all OK = 0.5346423
n = 5 | m = 2 | p = 3 | Prob all OK = 0.9796679
n = 5 | m = 3 | p = 0 | Prob all OK = 0.1009589
n = 5 | m = 3 | p = 1 | Prob all OK = 0.5671273
n = 5 | m = 3 | p = 2 | Prob all OK = 0.9871377
n = 5 | m = 4 | p = 0 | Prob all OK = 0.5919764
n = 5 | m = 4 | p = 1 | Prob all OK = 0.9938288
n = 5 | m = 5 | p = 0 | Prob all OK = 1

If "average" is no solution because the database lacks of words, I'd say: extend the database :)
another idea could be, to 'weigh' the results, to get light an adjusted average, as an example:
100% = 1.00x weight
90% = 0.95x weight
80% = 0.90x weight
...
0% = 0.50x weight
so for your example you would:
(100*1 + 90*0.95 + 0*0.5) / (100*1 + 100*0.95 + 100*0.5) = 0.75714285714
=> 75.7%
regular average would be 63.3%

Since the order of words is not important in your description, the independent variable is the fraction of valid words. If the fraction is a perfect 1, i.e. all words are found to be perfect matches with the DB, then you are perfectly sure to have the all-valid outcome. If it's zero, i.e. all words are perfect misses in the DB, then you are perfectly sure to have the all-invalid outcome. If you have .5, then this must be the unlikely mixed-up outcome because neither of the other two is possible.
You say the mixed outcome is unlikely while the two extremes are moreso. You are after likelihood of the all-valid outcome.
Let the fraction of valid words (sum of "surenesses" of matches / # of words) be f and hence the desired likelihood of the all-valid outcome be L(f). By the discussion so far, we know L(1)=1 and L(f)=0 for 0<=f<=1/2 .
To honor your information that the mixed outcome is less likely than the all-valid (and the all-invalid) outcome, the shape of L must rise monotonically and quickly from 1/2 toward 1 and reach 1 at f=1.
Since this is heuristic, we might pick any reasonable function with this character. If we're clever it will have a parameter to control the steepness of the step and perhaps another for its location. This lets us tweak what "less likely" means for the middle case.
One such function is this for 1/2 <= f <= 1:
L(f) = 5 + f * (-24 + (36 - 16 * f) * f) + (-4 + f * (16 + f * (-20 + 8 * f))) * s
and zero for 0 <= f < 1/2. Although it's hairy-looking, it's the simplest polynomial that intersects (1/2,0) and (1,1) with slope 0 at f=1 and slope s at f=0.
You can set 0 <= s <= 3 to change the step shape. Here is a shot with s=3, which probably what you want:
If you set s > 3, it shoots above 1 before settling down, not what we want.
Of course there are infinitely many other possibilities. If this one does't work, comment and we'll look for another.

averaging is, of course, rubbish. If the individual word probabilities were accurate, the probability that all words are correct is simply the product, not the average. If you have an estimate for the uncertainties in your individual probabilities, you could work out their product marginalized over all the individual probabilities.

Related

How to balance fill int into a symmetric matrix

Suppose I have a matrix A, it is symmetric. That is A(i,j)=A(j,i)
The value of A(i,j) can be i or j.
How can I fill the value into matrix A to make sure the exist times of each value as close as possible? (or as balance as possible)? Is there any algorithm can handle this?
Example A:
A = 1 1 1 1
1 2 2 2
1 2 3 3
1 2 3 4
exist times of 1 is 7
exist times of 2 is 5
exist times of 3 is 3
exist times of 4 is 1
Example B:
A = 1 2 1 1
2 2 3 2
1 3 3 4
1 2 4 4
exist times of 1 is 5
exist times of 2 is 4
exist times of 3 is 3
exist times of 4 is 3
In example B the values is (5,4,3,3), they are closer than example A (7,5,3,1)
I am looking forward a solution for nxn matrix.
Extend
If the matrix is sparse, that is the some element can not be filled in matrix. Which algorithm can be used to handle this problem?
Thanks for your time.
Found one solution, but without a real algorithm...
1 2 3 1 1
2 2 3 4 2
3 3 3 4 5
1 4 4 4 5
1 2 5 5 5
Basically: 25/5=5, looked for how to fill with 5 of each 1-5.
for 5 - reversed L from corner,
then up and left one spot for 4s,
and for 3s.
got "creative" for 2s and 1s...
I guess it's kind of algorithm...
Here is a solution written in python based on Weighted Bipartite Matching (or the isomorphic Minimum Cost Flow problem.)
#!/usr/bin/python
"""
filename: mcf_matrix_assign.py
purpose: demonstrate the use of weighted bipartite matching (isomorphic to MCF
with a suitable transform) to solve a matrix assignment problem with
certain conditions and optimization goals.
"""
import networkx as nx
N = 5
K = N # ensure K is large enough to satisfy flow, N <= K <= N*N
# setting K larger simply means a longer runtime
G = nx.DiGraph()
total_demand = 0
for i in range(N*N):
# assert a row-major linear indexing of the matrix
row, col = i / N, i % N
if row >= col:
continue # symmetry fix certain values
total_demand += 1
G.add_node('s'+str(i),demand=-1);
G.add_edge('s'+str(i), 'v'+str(row), weight = 0, capacity = 1)
G.add_edge('s'+str(i), 'v'+str(col), weight = 0, capacity = 1)
G.add_node('sink', demand = total_demand)
# attach each 'value' to the sink with incrementally larger weight
for i in range(N):
for j in range(K):
dummy_node = 'v'+str(i)+'w'+str(j)
G.add_edge('v'+str(i), dummy_node, weight = j, capacity = 1)
G.add_edge(dummy_node, 'sink', weight = 0, capacity = 1)
flow_dict = nx.min_cost_flow(G)
# decode the solution to get the matrix assignment reported by the MCF (or
# equivalently weighted bipartite matching)
solution = [ -1 for i in range(N*N) ]
for i in range(N*N):
# assert a row-major linear indexing of the matrix
row, col = i / N, i % N
if row == col:
solution[i] = row
continue # symmetry fix certain values
if row > col:
solution[i] = solution[col*N+row]
continue # symmetry fix certain values
adjacency = flow_dict['s'+str(i)]
solution[i] = row if adjacency['v'+str(row)] == 1 else col;
# print the solution
for row in range(N):
print ''.join(['-' for _ in range(4*N+1)])
print '|',
for col in range(N):
print str(solution[row*N+col]+1) + ' |',
print '\n',
print ''.join(['-' for _ in range(4*N+1)])
print 'Histogram summary:'
counts = [ (i+1, sum([ 0 if s != i else 1 for s in solution ])) for i in range(N) ]
for value, count in counts:
print ' Value ', value, " appears ", count, " times."
This produces the solution:
---------------------
| 1 | 1 | 3 | 1 | 5 |
---------------------
| 1 | 2 | 2 | 4 | 2 |
---------------------
| 3 | 2 | 3 | 4 | 3 |
---------------------
| 1 | 4 | 4 | 4 | 5 |
---------------------
| 5 | 2 | 3 | 5 | 5 |
---------------------
Histogram summary:
Value 1 appears 5 times.
Value 2 appears 5 times.
Value 3 appears 5 times.
Value 4 appears 5 times.
Value 5 appears 5 times.
And here is the solution when N=4 in the script.
-----------------
| 1 | 2 | 1 | 4 |
-----------------
| 2 | 2 | 3 | 4 |
-----------------
| 1 | 3 | 3 | 3 |
-----------------
| 4 | 4 | 3 | 4 |
-----------------
Histogram summary:
Value 1 appears 3 times.
Value 2 appears 3 times.
Value 3 appears 5 times.
Value 4 appears 5 times.
It's fairly easy to prove that this will always find an optimal answer in polynomial time.
Explanation
It is probably easiest to explain what is happening by describing the graph construction for a small case. For this discussion, fix N=3.
In this case we have a matrix assignment with variables
X s0 s1
X X s2
X X X
where X denotes a fixed value and sk denotes the kth slot in the array to fill.
In this case we also have 3 available value assignments [1,2,3] for each of the slots sk. (This is where it is easy to make modifications to the "allowed" values for any sk.)
If we construct a bipartite graph between the slots sk and the value assignments v1,v2,v3 in a way that edges of capacity 1 and weight zero are used to connect sk to each legal vi assignment, we can then solve it easily using MCF.
For illustration, the appropriate graph for N=3 is shown below:
Once the minimum cost flow is computed, we can decode the assignment by checking which edges are used in the solution.
A note on performance
networkx was used here in python purely out of convenience, it is by no means efficient in any sense of the word. The quality of implementation of the MCF algorithm in networkx is quite low and I would not recommend trying to scale it up.
For serious application, I would instead recommend the lemon MCF library (in particular the cost-scaling algorithm is competitive) or, you can use Andrew Goldberg's implementation of cost-scaling (which is hard to find but exists) and is probably quite efficient as well.
There is a special pattern to follow in order to get the best possible result. For each column (of row 1), start filling the matrix diagonally with values 1, 2, ..., n, fixing the correspondent symmetric slot. At the end, you will have the best possible result.
#include <iostream>
using namespace std;
int main(){
int n = 4; //size of matrix
int values[n]; for(int i = 0; i < n; i++) values[i] = 0;
int matrix[n][n]; for(int i = 0; i < n; i++) for(int j = 0; j < n; j++) matrix[i][j] = -1;
for(int c = 0; c < n; c++){
int i = 0, j = c;
for(int x = 0; x < n; x++){
if(matrix[i][j] != -1) {
break;
}
matrix[i][j] = matrix[j][i] = x;
i = (i + 1) % n;
j = (j + 1) % n;
}
}
for(int i = 0; i < n; i++){
for(int j = 0; j < n; j++){
cout<<matrix[i][j] + 1<<" ";
values[matrix[i][j]]++;
}
cout<<endl;
}
cout<<endl;
for (int i = 0; i < n; i++) {
cout<<(i + 1)<<" appears "<<values[i]<<" times"<<endl;
}
return 0;
}
OUTPUT
1 1 1 4
1 2 2 2
1 2 3 3
4 2 3 4
1 appears 5 times
2 appears 5 times
3 appears 3 times
4 appears 3 times
You can test it here.
The complexity is O(n²), since you have to fill all the matrix.
When n is odd, the solution is always n occurrences for each number, but when n is even, this is impossible.

Finding the largest power of a number that divides a factorial in haskell

So I am writing a haskell program to calculate the largest power of a number that divides a factorial.
largestPower :: Int -> Int -> Int
Here largestPower a b has find largest power of b that divides a!.
Now I understand the math behind it, the way to find the answer is to repeatedly divide a (just a) by b, ignore the remainder and finally add all the quotients. So if we have something like
largestPower 10 2
we should get 8 because 10/2=5/2=2/2=1 and we add 5+2+1=8
However, I am unable to figure out how to implement this as a function, do I use arrays or just a simple recursive function.
I am gravitating towards it being just a normal function, though I guess it can be done by storing quotients in an array and adding them.
Recursion without an accumulator
You can simply write a recursive algorithm and sum up the result of each call. Here we have two cases:
a is less than b, in which case the largest power is 0. So:
largestPower a b | a < b = 0
a is greater than or equal to b, in that case we divide a by b, calculate largestPower for that division, and add the division to the result. Like:
| otherwise = d + largestPower d b
where d = (div a b)
Or putting it together:
largestPower a b | a < b = 1
| otherwise = d + largestPower d b
where d = (div a b)
Recursion with an accumuator
You can also use recursion with an accumulator: a variable you pass through the recursion, and update accordingly. At the end, you return that accumulator (or a function called on that accumulator).
Here the accumulator would of course be the running product of divisions, so:
largestPower = largestPower' 0
So we will define a function largestPower' (mind the accent) with an accumulator as first argument that is initialized as 1.
Now in the recursion, there are two cases:
a is less than b, we simply return the accumulator:
largestPower' r a b | a < b = r
otherwise we multiply our accumulator with b, and pass the division to the largestPower' with a recursive call:
| otherwise = largestPower' (d+r) d b
where d = (div a b)
Or the full version:
largestPower = largestPower' 1
largestPower' r a b | a < b = r
| otherwise = largestPower' (d+r) d b
where d = (div a b)
Naive correct algorithm
The algorithm is not correct. A "naive" algorithm would be to simply divide every item and keep decrementing until you reach 1, like:
largestPower 1 _ = 0
largestPower a b = sumPower a + largestPower (a-1) b
where sumPower n | n `mod` b == 0 = 1 + sumPower (div n b)
| otherwise = 0
So this means that for the largestPower 4 2, this can be written as:
largestPower 4 2 = sumPower 4 + sumPower 3 + sumPower 2
and:
sumPower 4 = 1 + sumPower 2
= 1 + 1 + sumPower 1
= 1 + 1 + 0
= 2
sumPower 3 = 0
sumPower 2 = 1 + sumPower 1
= 1 + 0
= 1
So 3.
The algorithm as stated can be implemented quite simply:
largestPower :: Int -> Int -> Int
largestPower 0 b = 0
largestPower a b = d + largestPower d b where d = a `div` b
However, the algorithm is not correct for composite b. For example, largestPower 10 6 with this algorithm yields 1, but in fact the correct answer is 4. The problem is that this algorithm ignores multiples of 2 and 3 that are not multiples of 6. How you fix the algorithm is a completely separate question, though.

Kernel density estimation julia

I am trying to implement a kernel density estimation. However my code does not provide the answer it should. It is also written in julia but the code should be self explanatory.
Here is the algorithm:
where
So the algorithm tests whether the distance between x and an observation X_i weighted by some constant factor (the binwidth) is less then one. If so, it assigns 0.5 / (n * h) to that value, where n = #of observations.
Here is my implementation:
#Kernel density function.
#Purpose: estimate the probability density function (pdf)
#of given observations
##param data: observations for which the pdf should be estimated
##return: returns an array with the estimated densities
function kernelDensity(data)
|
| #Uniform kernel function.
| ##param x: Current x value
| ##param X_i: x value of observation i
| ##param width: binwidth
| ##return: Returns 1 if the absolute distance from
| #x(current) to x(observation) weighted by the binwidth
| #is less then 1. Else it returns 0.
|
| function uniformKernel(x, observation, width)
| | u = ( x - observation ) / width
| | abs ( u ) <= 1 ? 1 : 0
| end
|
| #number of observations in the data set
| n = length(data)
|
| #binwidth (set arbitraily to 0.1
| h = 0.1
|
| #vector that stored the pdf
| res = zeros( Real, n )
|
| #counter variable for the loop
| counter = 0
|
| #lower and upper limit of the x axis
| start = floor(minimum(data))
| stop = ceil (maximum(data))
|
| #main loop
| ##linspace: divides the space from start to stop in n
| #equally spaced intervalls
| for x in linspace(start, stop, n)
| | counter += 1
| | for observation in data
| | |
| | | #count all observations for which the kernel
| | | #returns 1 and mult by 0.5 because the
| | | #kernel computed the absolute difference which can be
| | | #either positive or negative
| | | res[counter] += 0.5 * uniformKernel(x, observation, h)
| | end
| | #devide by n times h
| | res[counter] /= n * h
| end
| #return results
| res
end
#run function
##rand: generates 10 uniform random numbers between 0 and 1
kernelDensity(rand(10))
and this is being returned:
> 0.0
> 1.5
> 2.5
> 1.0
> 1.5
> 1.0
> 0.0
> 0.5
> 0.5
> 0.0
the sum of which is: 8.5 (The cumulative distibution function. Should be 1.)
So there are two bugs:
The values are not properly scaled. Each number should be around one tenth of their current values. In fact, if the number of observation increases by 10^n n = 1, 2, ... then the cdf also increases by 10^n
For example:
> kernelDensity(rand(1000))
> 953.53
They don't sum up to 10 (or one if it were not for the scaling error). The error becomes more evident as the sample size increases: there are approx. 5% of the observations not being included.
I believe that I implemented the formula 1:1, hence I really don't understand where the error is.
I'm not an expert on KDEs, so take all of this with a grain of salt, but a very similar (but much faster!) implementation of your code would be:
function kernelDensity{T<:AbstractFloat}(data::Vector{T}, h::T)
res = similar(data)
lb = minimum(data); ub = maximum(data)
for (i,x) in enumerate(linspace(lb, ub, size(data,1)))
for obs in data
res[i] += abs((obs-x)/h) <= 1. ? 0.5 : 0.
end
res[i] /= (n*h)
end
sum(res)
end
If I'm not mistaken, the density estimate should integrate to 1, that is we would expect kernelDensity(rand(100), 0.1)/100 to get at least close to 1. In the implementation above I'm getting there, give or take 5%, but then again we don't know that 0.1 is the optimal bandwith (using h=0.135 instead I'm getting there to within 0.1%), and the uniform Kernel is known to only be about 93% "efficient".
In any case, there's a very good Kernel Density package in Julia available here, so you probably should just do Pkg.add("KernelDensity") instead of trying to code your own Epanechnikov kernel :)
To point out the mistake: You have n bins B_i of size 2h covering [0,1], a random point X lands in expected number of bins. You divide by 2 n h.
For n points, the expected value of your function is .
Actually, you have some bins of size < 2h. (for example if start = 0, half of first the bin is outside of [0,1]), factoring this in gives the bias.
Edit: Btw, the bias is easy to calculate if you assume that the bins have random locations in [0,1]. Then the bins are on average missing h/2 = 5% of their size.

How to find the maximum number of matching data?

Given a bidimensionnal array such as:
-----------------------
| | 1 | 2 | 3 | 4 | 5 |
|-------------------|---|
| 1 | X | X | O | O | X |
|-------------------|---|
| 2 | O | O | O | X | X |
|-------------------|---|
| 3 | X | X | O | X | X |
|-------------------|---|
| 4 | X | X | O | X | X |
-----------------------
I have to find the largest set of cells currently containing O with a maximum of one cell per row and one per column.
For instance, in the previous example, the optimal answer is 3, when:
row 1 goes with column 4;
row 2 goes with column 1 (or 2);
row 3 (or 4) goes with column 3.
It seems that I have to find an algorithm in O(CR) (where C is the number of columns and R the number of rows).
My first idea was to sort the rows in ascending order according to its number on son. Here is how the algorithm would look like:
For i From 0 To R
For j From 0 To N
If compatible(i, j)
add(a[j], i)
Sort a according to a[j].size
result = 0
For i From 0 To N
For j From 0 to a[i].size
if used[a[i][j]] = false
used[a[i][j]] = true
result = result + 1
break
Print result
Altough I didn't find any counterexample, I don't know whether it always gives the optimal answer.
Is this algorithm correct? Is there any better solution?
Going off Billiska's suggestion, I found a nice implementation of the "Hopcroft-Karp" algorithm in Python here:
http://code.activestate.com/recipes/123641-hopcroft-karp-bipartite-matching/
This algorithm is one of several that solves the maximum bipartite matching problem, using that code exactly "as-is" here's how I solved example problem in your post (in Python):
from collections import defaultdict
X=0; O=1;
patterns = [ [ X , X , O , O , X ],
[ O , O , O , X , X ],
[ X , X , O , X , X ],
[ X , X , O , X , X ]]
G = defaultdict(list)
for i, x in enumerate(patterns):
for j, y in enumerate(patterns):
if( patterns[i][j] ):
G['Row '+str(i)].append('Col '+str(j))
solution = bipartiteMatch(G) ### function defined in provided link
print len(solution[0]), solution[0]

nᵗʰ ugly number

Numbers whose only prime factors are 2, 3, or 5 are called ugly numbers.
Example:
1, 2, 3, 4, 5, 6, 8, 9, 10, 12, 15, ...
1 can be considered as 2^0.
I am working on finding nth ugly number. Note that these numbers are extremely sparsely distributed as n gets large.
I wrote a trivial program that computes if a given number is ugly or not. For n > 500 - it became super slow. I tried using memoization - observation: ugly_number * 2, ugly_number * 3, ugly_number * 5 are all ugly. Even with that it is slow. I tried using some properties of log - since that will reduce this problem from multiplication to addition - but, not much luck yet. Thought of sharing this with you all. Any interesting ideas?
Using a concept similar to Sieve of Eratosthenes (thanks Anon)
for (int i(2), uglyCount(0); ; i++) {
if (i % 2 == 0)
continue;
if (i % 3 == 0)
continue;
if (i % 5 == 0)
continue;
uglyCount++;
if (uglyCount == n - 1)
break;
}
i is the nth ugly number.
Even this is pretty slow. I am trying to find the 1500th ugly number.
A simple fast solution in Java. Uses approach described by Anon..
Here TreeSet is just a container capable of returning smallest element in it. (No duplicates stored.)
int n = 20;
SortedSet<Long> next = new TreeSet<Long>();
next.add((long) 1);
long cur = 0;
for (int i = 0; i < n; ++i) {
cur = next.first();
System.out.println("number " + (i + 1) + ": " + cur);
next.add(cur * 2);
next.add(cur * 3);
next.add(cur * 5);
next.remove(cur);
}
Since 1000th ugly number is 51200000, storing them in bool[] isn't really an option.
edit
As a recreation from work (debugging stupid Hibernate), here's completely linear solution. Thanks to marcog for idea!
int n = 1000;
int last2 = 0;
int last3 = 0;
int last5 = 0;
long[] result = new long[n];
result[0] = 1;
for (int i = 1; i < n; ++i) {
long prev = result[i - 1];
while (result[last2] * 2 <= prev) {
++last2;
}
while (result[last3] * 3 <= prev) {
++last3;
}
while (result[last5] * 5 <= prev) {
++last5;
}
long candidate1 = result[last2] * 2;
long candidate2 = result[last3] * 3;
long candidate3 = result[last5] * 5;
result[i] = Math.min(candidate1, Math.min(candidate2, candidate3));
}
System.out.println(result[n - 1]);
The idea is that to calculate a[i], we can use a[j]*2 for some j < i. But we also need to make sure that 1) a[j]*2 > a[i - 1] and 2) j is smallest possible.
Then, a[i] = min(a[j]*2, a[k]*3, a[t]*5).
I am working on finding nth ugly number. Note that these numbers are extremely sparsely distributed as n gets large.
I wrote a trivial program that computes if a given number is ugly or not.
This looks like the wrong approach for the problem you're trying to solve - it's a bit of a shlemiel algorithm.
Are you familiar with the Sieve of Eratosthenes algorithm for finding primes? Something similar (exploiting the knowledge that every ugly number is 2, 3 or 5 times another ugly number) would probably work better for solving this.
With the comparison to the Sieve I don't mean "keep an array of bools and eliminate possibilities as you go up". I am more referring to the general method of generating solutions based on previous results. Where the Sieve gets a number and then removes all multiples of it from the candidate set, a good algorithm for this problem would start with an empty set and then add the correct multiples of each ugly number to that.
My answer refers to the correct answer given by Nikita Rybak.
So that one could see a transition from the idea of the first approach to that of the second.
from collections import deque
def hamming():
h=1;next2,next3,next5=deque([]),deque([]),deque([])
while True:
yield h
next2.append(2*h)
next3.append(3*h)
next5.append(5*h)
h=min(next2[0],next3[0],next5[0])
if h == next2[0]: next2.popleft()
if h == next3[0]: next3.popleft()
if h == next5[0]: next5.popleft()
What's changed from Nikita Rybak's 1st approach is that, instead of adding next candidates into single data structure, i.e. Tree set, one can add each of them separately into 3 FIFO lists. This way, each list will be kept sorted all the time, and the next least candidate must always be at the head of one ore more of these lists.
If we eliminate the use of the three lists above, we arrive at the second implementation in Nikita Rybak' answer. This is done by evaluating those candidates (to be contained in three lists) only when needed, so that there is no need to store them.
Simply put:
In the first approach, we put every new candidate into single data structure, and that's bad because too many things get mixed up unwisely. This poor strategy inevitably entails O(log(tree size)) time complexity every time we make a query to the structure. By putting them into separate queues, however, you will see that each query takes only O(1) and that's why the overall performance reduces to O(n)!!! This is because each of the three lists is already sorted, by itself.
I believe you can solve this problem in sub-linear time, probably O(n^{2/3}).
To give you the idea, if you simplify the problem to allow factors of just 2 and 3, you can achieve O(n^{1/2}) time starting by searching for the smallest power of two that is at least as large as the nth ugly number, and then generating a list of O(n^{1/2}) candidates. This code should give you an idea how to do it. It relies on the fact that the nth number containing only powers of 2 and 3 has a prime factorization whose sum of exponents is O(n^{1/2}).
def foo(n):
p2 = 1 # current power of 2
p3 = 1 # current power of 3
e3 = 0 # exponent of current power of 3
t = 1 # number less than or equal to the current power of 2
while t < n:
p2 *= 2
if p3 * 3 < p2:
p3 *= 3
e3 += 1
t += 1 + e3
candidates = [p2]
c = p2
for i in range(e3):
c /= 2
c *= 3
if c > p2: c /= 2
candidates.append(c)
return sorted(candidates)[n - (t - len(candidates))]
The same idea should work for three allowed factors, but the code gets more complex. The sum of the powers of the factorization drops to O(n^{1/3}), but you need to consider more candidates, O(n^{2/3}) to be more precise.
A lot of good answers here, but I was having trouble understanding those, specifically how any of these answers, including the accepted one, maintained the axiom 2 in Dijkstra's original paper:
Axiom 2. If x is in the sequence, so is 2 * x, 3 * x, and 5 * x.
After some whiteboarding, it became clear that the axiom 2 is not an invariant at each iteration of the algorithm, but actually the goal of the algorithm itself. At each iteration, we try to restore the condition in axiom 2. If last is the last value in the result sequence S, axiom 2 can simply be rephrased as:
For some x in S, the next value in S is the minimum of 2x,
3x, and 5x, that is greater than last. Let's call this axiom 2'.
Thus, if we can find x, we can compute the minimum of 2x, 3x, and 5x in constant time, and add it to S.
But how do we find x? One approach is, we don't; instead, whenever we add a new element e to S, we compute 2e, 3e, and 5e, and add them to a minimum priority queue. Since this operations guarantees e is in S, simply extracting the top element of the PQ satisfies axiom 2'.
This approach works, but the problem is that we generate a bunch of numbers we may not end up using. See this answer for an example; if the user wants the 5th element in S (5), the PQ at that moment holds 6 6 8 9 10 10 12 15 15 20 25. Can we not waste this space?
Turns out, we can do better. Instead of storing all these numbers, we simply maintain three counters for each of the multiples, namely, 2i, 3j, and 5k. These are candidates for the next number in S. When we pick one of them, we increment only the corresponding counter, and not the other two. By doing so, we are not eagerly generating all the multiples, thus solving the space problem with the first approach.
Let's see a dry run for n = 8, i.e. the number 9. We start with 1, as stated by axiom 1 in Dijkstra's paper.
+---------+---+---+---+----+----+----+-------------------+
| # | i | j | k | 2i | 3j | 5k | S |
+---------+---+---+---+----+----+----+-------------------+
| initial | 1 | 1 | 1 | 2 | 3 | 5 | {1} |
+---------+---+---+---+----+----+----+-------------------+
| 1 | 1 | 1 | 1 | 2 | 3 | 5 | {1,2} |
+---------+---+---+---+----+----+----+-------------------+
| 2 | 2 | 1 | 1 | 4 | 3 | 5 | {1,2,3} |
+---------+---+---+---+----+----+----+-------------------+
| 3 | 2 | 2 | 1 | 4 | 6 | 5 | {1,2,3,4} |
+---------+---+---+---+----+----+----+-------------------+
| 4 | 3 | 2 | 1 | 6 | 6 | 5 | {1,2,3,4,5} |
+---------+---+---+---+----+----+----+-------------------+
| 5 | 3 | 2 | 2 | 6 | 6 | 10 | {1,2,3,4,5,6} |
+---------+---+---+---+----+----+----+-------------------+
| 6 | 4 | 2 | 2 | 8 | 6 | 10 | {1,2,3,4,5,6} |
+---------+---+---+---+----+----+----+-------------------+
| 7 | 4 | 3 | 2 | 8 | 9 | 10 | {1,2,3,4,5,6,8} |
+---------+---+---+---+----+----+----+-------------------+
| 8 | 5 | 3 | 2 | 10 | 9 | 10 | {1,2,3,4,5,6,8,9} |
+---------+---+---+---+----+----+----+-------------------+
Notice that S didn't grow at iteration 6, because the minimum candidate 6 had already been added previously. To avoid this problem of having to remember all of the previous elements, we amend our algorithm to increment all the counters whenever the corresponding multiples are equal to the minimum candidate. That brings us to the following Scala implementation.
def hamming(n: Int): Seq[BigInt] = {
#tailrec
def next(x: Int, factor: Int, xs: IndexedSeq[BigInt]): Int = {
val leq = factor * xs(x) <= xs.last
if (leq) next(x + 1, factor, xs)
else x
}
#tailrec
def loop(i: Int, j: Int, k: Int, xs: IndexedSeq[BigInt]): IndexedSeq[BigInt] = {
if (xs.size < n) {
val a = next(i, 2, xs)
val b = next(j, 3, xs)
val c = next(k, 5, xs)
val m = Seq(2 * xs(a), 3 * xs(b), 5 * xs(c)).min
val x = a + (if (2 * xs(a) == m) 1 else 0)
val y = b + (if (3 * xs(b) == m) 1 else 0)
val z = c + (if (5 * xs(c) == m) 1 else 0)
loop(x, y, z, xs :+ m)
} else xs
}
loop(0, 0, 0, IndexedSeq(BigInt(1)))
}
Basicly the search could be made O(n):
Consider that you keep a partial history of ugly numbers. Now, at each step you have to find the next one. It should be equal to a number from the history multiplied by 2, 3 or 5. Chose the smallest of them, add it to history, and drop some numbers from it so that the smallest from the list multiplied by 5 would be larger than the largest.
It will be fast, because the search of the next number will be simple:
min(largest * 2, smallest * 5, one from the middle * 3),
that is larger than the largest number in the list. If they are scarse, the list will always contain few numbers, so the search of the number that have to be multiplied by 3 will be fast.
Here is a correct solution in ML. The function ugly() will return a stream (lazy list) of hamming numbers. The function nth can be used on this stream.
This uses the Sieve method, the next elements are only calculated when needed.
datatype stream = Item of int * (unit->stream);
fun cons (x,xs) = Item(x, xs);
fun head (Item(i,xf)) = i;
fun tail (Item(i,xf)) = xf();
fun maps f xs = cons(f (head xs), fn()=> maps f (tail xs));
fun nth(s,1)=head(s)
| nth(s,n)=nth(tail(s),n-1);
fun merge(xs,ys)=if (head xs=head ys) then
cons(head xs,fn()=>merge(tail xs,tail ys))
else if (head xs<head ys) then
cons(head xs,fn()=>merge(tail xs,ys))
else
cons(head ys,fn()=>merge(xs,tail ys));
fun double n=n*2;
fun triple n=n*3;
fun ij()=
cons(1,fn()=>
merge(maps double (ij()),maps triple (ij())));
fun quint n=n*5;
fun ugly()=
cons(1,fn()=>
merge((tail (ij())),maps quint (ugly())));
This was first year CS work :-)
To find the n-th ugly number in O (n^(2/3)), jonderry's algorithm will work just fine. Note that the numbers involved are huge so any algorithm trying to check whether a number is ugly or not has no chance.
Finding all of the n smallest ugly numbers in ascending order is done easily by using a priority queue in O (n log n) time and O (n) space: Create a priority queue of numbers with the smallest numbers first, initially including just the number 1. Then repeat n times: Remove the smallest number x from the priority queue. If x hasn't been removed before, then x is the next larger ugly number, and we add 2x, 3x and 5x to the priority queue. (If anyone doesn't know the term priority queue, it's like the heap in the heapsort algorithm). Here's the start of the algorithm:
1 -> 2 3 5
1 2 -> 3 4 5 6 10
1 2 3 -> 4 5 6 6 9 10 15
1 2 3 4 -> 5 6 6 8 9 10 12 15 20
1 2 3 4 5 -> 6 6 8 9 10 10 12 15 15 20 25
1 2 3 4 5 6 -> 6 8 9 10 10 12 12 15 15 18 20 25 30
1 2 3 4 5 6 -> 8 9 10 10 12 12 15 15 18 20 25 30
1 2 3 4 5 6 8 -> 9 10 10 12 12 15 15 16 18 20 24 25 30 40
Proof of execution time: We extract an ugly number from the queue n times. We initially have one element in the queue, and after extracting an ugly number we add three elements, increasing the number by 2. So after n ugly numbers are found we have at most 2n + 1 elements in the queue. Extracting an element can be done in logarithmic time. We extract more numbers than just the ugly numbers but at most n ugly numbers plus 2n - 1 other numbers (those that could have been in the sieve after n-1 steps). So the total time is less than 3n item removals in logarithmic time = O (n log n), and the total space is at most 2n + 1 elements = O (n).
I guess we can use Dynamic Programming (DP) and compute nth Ugly Number. Complete explanation can be found at http://www.geeksforgeeks.org/ugly-numbers/
#include <iostream>
#define MAX 1000
using namespace std;
// Find Minimum among three numbers
long int min(long int x, long int y, long int z) {
if(x<=y) {
if(x<=z) {
return x;
} else {
return z;
}
} else {
if(y<=z) {
return y;
} else {
return z;
}
}
}
// Actual Method that computes all Ugly Numbers till the required range
long int uglyNumber(int count) {
long int arr[MAX], val;
// index of last multiple of 2 --> i2
// index of last multiple of 3 --> i3
// index of last multiple of 5 --> i5
int i2, i3, i5, lastIndex;
arr[0] = 1;
i2 = i3 = i5 = 0;
lastIndex = 1;
while(lastIndex<=count-1) {
val = min(2*arr[i2], 3*arr[i3], 5*arr[i5]);
arr[lastIndex] = val;
lastIndex++;
if(val == 2*arr[i2]) {
i2++;
}
if(val == 3*arr[i3]) {
i3++;
}
if(val == 5*arr[i5]) {
i5++;
}
}
return arr[lastIndex-1];
}
// Starting point of program
int main() {
long int num;
int count;
cout<<"Which Ugly Number : ";
cin>>count;
num = uglyNumber(count);
cout<<endl<<num;
return 0;
}
We can see that its quite fast, just change the value of MAX to compute higher Ugly Number
Using 3 generators in parallel and selecting the smallest at each iteration, here is a C program to compute all ugly numbers below 2128 in less than 1 second:
#include <limits.h>
#include <stdio.h>
#if 0
typedef unsigned long long ugly_t;
#define UGLY_MAX (~(ugly_t)0)
#else
typedef __uint128_t ugly_t;
#define UGLY_MAX (~(ugly_t)0)
#endif
int print_ugly(int i, ugly_t u) {
char buf[64], *p = buf + sizeof(buf);
*--p = '\0';
do { *--p = '0' + u % 10; } while ((u /= 10) != 0);
return printf("%d: %s\n", i, p);
}
int main() {
int i = 0, n2 = 0, n3 = 0, n5 = 0;
ugly_t u, ug2 = 1, ug3 = 1, ug5 = 1;
#define UGLY_COUNT 110000
ugly_t ugly[UGLY_COUNT];
while (i < UGLY_COUNT) {
u = ug2;
if (u > ug3) u = ug3;
if (u > ug5) u = ug5;
if (u == UGLY_MAX)
break;
ugly[i++] = u;
print_ugly(i, u);
if (u == ug2) {
if (ugly[n2] <= UGLY_MAX / 2)
ug2 = 2 * ugly[n2++];
else
ug2 = UGLY_MAX;
}
if (u == ug3) {
if (ugly[n3] <= UGLY_MAX / 3)
ug3 = 3 * ugly[n3++];
else
ug3 = UGLY_MAX;
}
if (u == ug5) {
if (ugly[n5] <= UGLY_MAX / 5)
ug5 = 5 * ugly[n5++];
else
ug5 = UGLY_MAX;
}
}
return 0;
}
Here are the last 10 lines of output:
100517: 338915443777200000000000000000000000000
100518: 339129266201729628114355465608000000000
100519: 339186548067800934969350553600000000000
100520: 339298130282929870605468750000000000000
100521: 339467078447341918945312500000000000000
100522: 339569540691046437734055936000000000000
100523: 339738624000000000000000000000000000000
100524: 339952965770562084651663360000000000000
100525: 340010386766614455386112000000000000000
100526: 340122240000000000000000000000000000000
Here is a version in Javascript usable with QuickJS:
import * as std from "std";
function main() {
var i = 0, n2 = 0, n3 = 0, n5 = 0;
var u, ug2 = 1n, ug3 = 1n, ug5 = 1n;
var ugly = [];
for (;;) {
u = ug2;
if (u > ug3) u = ug3;
if (u > ug5) u = ug5;
ugly[i++] = u;
std.printf("%d: %s\n", i, String(u));
if (u >= 0x100000000000000000000000000000000n)
break;
if (u == ug2)
ug2 = 2n * ugly[n2++];
if (u == ug3)
ug3 = 3n * ugly[n3++];
if (u == ug5)
ug5 = 5n * ugly[n5++];
}
return 0;
}
main();
here is my code , the idea is to divide the number by 2 (till it gives remainder 0) then 3 and 5 . If at last the number becomes one it's a ugly number.
you can count and even print all ugly numbers till n.
int count = 0;
for (int i = 2; i <= n; i++) {
int temp = i;
while (temp % 2 == 0) temp=temp / 2;
while (temp % 3 == 0) temp=temp / 3;
while (temp % 5 == 0) temp=temp / 5;
if (temp == 1) {
cout << i << endl;
count++;
}
}
This problem can be done in O(1).
If we remove 1 and look at numbers between 2 through 30, we will notice that there are 22 numbers.
Now, for any number x in the 22 numbers above, there will be a number x + 30 in between 31 and 60 that is also ugly. Thus, we can find at least 22 numbers between 31 and 60. Now for every ugly number between 31 and 60, we can write it as s + 30. So s will be ugly too, since s + 30 is divisible by 2, 3, or 5. Thus, there will be exactly 22 numbers between 31 and 60. This logic can be repeated for every block of 30 numbers after that.
Thus, there will be 23 numbers in the first 30 numbers, and 22 for every 30 after that. That is, first 23 uglies will occur between 1 and 30, 45 uglies will occur between 1 and 60, 67 uglies will occur between 1 and 30 etc.
Now, if I am given n, say 137, I can see that 137/22 = 6.22. The answer will lie between 6*30 and 7*30 or between 180 and 210. By 180, I will have 6*22 + 1 = 133rd ugly number at 180. I will have 154th ugly number at 210. So I am looking for 4th ugly number (since 137 = 133 + 4)in the interval [2, 30], which is 5. The 137th ugly number is then 180 + 5 = 185.
Another example: if I want the 1500th ugly number, I count 1500/22 = 68 blocks. Thus, I will have 22*68 + 1 = 1497th ugly at 30*68 = 2040. The next three uglies in the [2, 30] block are 2, 3, and 4. So our required ugly is at 2040 + 4 = 2044.
The point it that I can simply build a list of ugly numbers between [2, 30] and simply find the answer by doing look ups in O(1).
Here is another O(n) approach (Python solution) based on the idea of merging three sorted lists. The challenge is to find the next ugly number in increasing order. For example, we know the first seven ugly numbers are [1,2,3,4,5,6,8]. The ugly numbers are actually from the following three lists:
list 1: 1*2, 2*2, 3*2, 4*2, 5*2, 6*2, 8*2 ... ( multiply each ugly number by 2 )
list 2: 1*3, 2*3, 3*3, 4*3, 5*3, 6*3, 8*3 ... ( multiply each ugly number by 3 )
list 3: 1*5, 2*5, 3*5, 4*5, 5*5, 6*5, 8*5 ... ( multiply each ugly number by 5 )
So the nth ugly number is the nth number of the list merged from the three lists above:
1, 1*2, 1*3, 2*2, 1*5, 2*3 ...
def nthuglynumber(n):
p2, p3, p5 = 0,0,0
uglynumber = [1]
while len(uglynumber) < n:
ugly2, ugly3, ugly5 = uglynumber[p2]*2, uglynumber[p3]*3, uglynumber[p5]*5
next = min(ugly2, ugly3, ugly5)
if next == ugly2: p2 += 1 # multiply each number
if next == ugly3: p3 += 1 # only once by each
if next == ugly5: p5 += 1 # of the three factors
uglynumber += [next]
return uglynumber[-1]
STEP I: computing three next possible ugly numbers from the three lists
ugly2, ugly3, ugly5 = uglynumber[p2]*2, uglynumber[p3]*3, uglynumber[p5]*5
STEP II, find the one next ugly number as the smallest of the three above:
next = min(ugly2, ugly3, ugly5)
STEP III: moving the pointer forward if its ugly number was the next ugly number
if next == ugly2: p2+=1
if next == ugly3: p3+=1
if next == ugly5: p5+=1
note: not using if with elif nor else
STEP IV: adding the next ugly number into the merged list uglynumber
uglynumber += [next]

Resources