algorithm to find closest string using same characters - algorithm

Given a list L of n character strings, and an input character string S, what is an efficient way to find the character string in L that contains the most characters that exist in S? We want to find the string in L that is most-closely made up of the letters contained in S.
The obvious answer is to loop through all n strings and check to see how many characters in the current string exist in S. However, this algorithm will be run frequently, and the list L of n string will be stored in a database... loop manually through all n strings would require something like big-Oh of n*m^2, where n is the number of strings in L, and m is the max length of any string in L, as well as the max length of S... in this case m is actually a constant of 150.
Is there a better way than just a simple loop? Is there a data structure I can load the n strings into that would give me fast search ability? Is there an algorithm that uses the pre-calculated meta-data about each of the n strings that would perform better than a loop?
I know there are a lot of geeks out there that are into the algorithms. So please help!
Thanks!

If you are after substrings, a Trie or Patrica trie might be a good starting point.
If you don't care about the order, just about the number of each symbol or letter, I would calculate the histogram of all strings and then compare them with the histogram of the input.
ABCDEFGHIJKLMNOPQRSTUVWXYZ
Hello World => ...11..1...3..2..1....1...
This will lower the costs to O(26 * m + n) plus the preprocessing once if you consider only case-insensitive latin letters.
If m is constant, you could interpret the histogram as a 26 dimensional vector on a 26 dimensional unit sphere by normalizing it. Then you could just calculate the Dot Product of two vectors yielding the cosine of the angle between the two vectors, and this value should be proportional to the similarity of the strings.
Assuming m = 3, a alphabet A = { 'U', 'V', 'W' } of size three only, and the following list of strings.
L = { "UUU", "UVW", "WUU" }
The histograms are the following.
H = { (3, 0, 0), (1, 1, 1), (2, 0, 1) }
A histogram h = (x, y, z) is normalized to h' = (x/r, y/r, z/r) with r the Euclidian norm of the histogram h - that is r = sqrt(x² + y² + z²).
H' = { (1.000, 0.000, 0.000), (0.577, 0.577, 0.577), (0.894, 0.000, 0.447) }
The input S = "VVW" has the histogram hs = (0, 2, 1) and the normalized histogram hs' = (0.000, 0.894, 0.447).
Now we can calculate the similarity of two histograms h1 = (a, b, c) and h2 = (x, y, z) as the Euclidian distance of both histograms.
d(h1, h2) = sqrt((a - x)² + (b - y)² + (c - z)²)
For the example we obtain.
d((3, 0, 0), (0, 2, 1)) = 3.742
d((1, 1, 1), (0, 2, 1)) = 1.414
d((2, 0, 1), (0, 2, 1)) = 2.828
Hence "UVW" is closest to "VVW" (smaller numbers indicate higher similarity).
Using the normalized histograms h1' = (a', b', c') and h2' = (x', y', z') we can calculate the distance as the dot product of both histograms.
d'(h1', h2') = a'x' + b'y' + c'z'
For the example we obtain.
d'((1.000, 0.000, 0.000), (0.000, 0.894, 0.447)) = 0.000
d'((0.577, 0.577, 0.577), (0.000, 0.894, 0.447)) = 0.774
d'((0.894, 0.000, 0.447), (0.000, 0.894, 0.447)) = 0.200
Again "UVW" is determined to be closest to "VVW" (larger numbers indicate higher similarity).
Both version yield different numbers, but the results are always the same. One could also use other norms - Manhattan distance (L1 norm) for example - but this will only change the numbers because norms in finite dimensional vector spaces are all equivalent.

Sounds like you need a trie. Tries are used to search for words similar to the way a spell checker will work. So if the String S has the characters in the same order as the Strings in L then this may work for you.
If however, the order of the characters in S is not relevant - like a set of scrabble tiles and you want to search for the longest word - then this is not your solution.

What you want is a BK-Tree. It's a bit unintuitive, but very cool - and it makes it possible to search for elements within a levenshtein (edit) distance threshold in O(log n) time.
If you care about ordering in your input strings, use them as is. If you don't you can sort the individual characters before inserting them into the BK-Tree (or querying with them).

I believe what you're looking for can be found here: Fuzzy Logic Based Search Technique
It's pretty heavy, but so is what you're asking for. It talks about word similarities, and character misplacement.
i.e:
L I N E A R T R N A S F O R M
L I N A E R T R A N S F O R M
L E N E A R T R A N S F R M

it seems to me that the order of the characters is not important in your problem, but you are searching for "near-anagrams" of the word S.
If that's so, then you can represent every word in the set L as an array of 26 integers (assuming your alphabet has 26 letters). You can represent S similarly as an array of 26 integers; now to find the best match you just run once through the set L and calculate a distance metric between the S-vector and the current L-vector, however you want to define the distance metric (e.g. euclidean / sum-of-squares or Manhattan / sum of absolute differences). This is O(n) algorithm because the vectors have constant lengths.

Here is a T-SQL function that has been working great for me, gives you the edit distance:
Example:
SELECT TOP 1 [StringValue] , edit_distance([StringValue, 'Input Value')
FROM [SomeTable]
ORDER BY edit_distance([StringValue, 'Input Value')
The Function:
CREATE FUNCTION edit_distance(#s1 nvarchar(3999), #s2 nvarchar(3999))
RETURNS int
AS
BEGIN
DECLARE #s1_len int, #s2_len int, #i int, #j int, #s1_char nchar, #c int, #c_temp int,
#cv0 varbinary(8000), #cv1 varbinary(8000)
SELECT #s1_len = LEN(#s1), #s2_len = LEN(#s2), #cv1 = 0x0000, #j = 1, #i = 1, #c = 0
WHILE #j <= #s2_len
SELECT #cv1 = #cv1 + CAST(#j AS binary(2)), #j = #j + 1
WHILE #i <= #s1_len
BEGIN
SELECT #s1_char = SUBSTRING(#s1, #i, 1), #c = #i, #cv0 = CAST(#i AS binary(2)), #j = 1
WHILE #j <= #s2_len
BEGIN
SET #c = #c + 1
SET #c_temp = CAST(SUBSTRING(#cv1, #j+#j-1, 2) AS int) +
CASE WHEN #s1_char = SUBSTRING(#s2, #j, 1) THEN 0 ELSE 1 END
IF #c > #c_temp SET #c = #c_temp
SET #c_temp = CAST(SUBSTRING(#cv1, #j+#j+1, 2) AS int)+1
IF #c > #c_temp SET #c = #c_temp
SELECT #cv0 = #cv0 + CAST(#c AS binary(2)), #j = #j + 1
END
SELECT #cv1 = #cv0, #i = #i + 1
END
RETURN #c
END

Related

Number of Ways To arrange Sequence

I am having a M character, from these character i need to make a sequence of length N such that no two consecutive character are same and also first and last character of the sequence is fix. So i need to find the total number of ways.
My Approach:
Dynamic programming.
If first and last character are '0' and '1'
dp[1][0]=1 , dp[1][1]=1
for(int i=2;i<N;i++)
for(int j=0;j<M;j++)
for(int k=0;k<M;k++)
if(j!=k) dp[i][j]+=dp[i-1][k]
So final answer would summation dp[n-1][i] , i!=1
Problem:
Here length N is too large around 10^15 and M is around 128, how find the number of permutation without using arrays ?
Assume M is fixed. Let D(n) be the number of sequences of length n with no repeated characters where the first and last character differ (but are fixed). Let S(n) be the number of sequences of length n where the first and last characters are the same (but are fixed).
For example, D(6) is the number of strings of the form a????b (for some a and b -- noting that for counting it doesn't matter which two characters we chose, and where the ? represent other characters). Similarly, S(6) is the number of strings of the form a????a.
Consider a sequence of length n>3 of the form a....?b. The ? can be any of m-1 characters (anything except b). One of these is a. So D(n) = S(n-1) + (m-2)D(n-1). Using a similar argument, one can figure out that S(n) = (M-1)D(n-1).
For example, how many strings are there of the form a??b? Well, the character just before the b could be a or something else. How many strings are there when it's a? Well, it's the same as the number of strings of the form a?a. How many strings are there when it's something else? Well it's the same as the number of strings of the form a?c multiplied by the number of choices we had for c (namely: m-2 -- everything except for a which we've already counted, and b which is excluded by the rules).
If n is odd, we can consider the middle character. Consider a sequence of length n of the form a...?...b. The ? (which is in the center of the string) can be a, b, or one of the other M-2 characters. Thus D(2n+1) = S(n+1)D(n+1) + D(n+1)S(n+1) + (M-2)D(n+1)D(n+1). Similarly, S(2n+1) = S(n+1)S(n+1) + (M-1)D(n+1)D(n+1).
For small n, S(2)=0, S(3)=M-1, D(2)=1, D(3)=M-2.
We can use the above equations (the first set for even n>3, the second set for odd n>3, and the base cases for n=2 or 3 to compute the result you need in O(log N) arithmetic operations. Presumably the question asks you to compute the result modulo something (since the result grows like O(M^(N-2)), but that's easy to incorporate into the results.
Working code that uses this approach:
def C(n, m, p):
if n == 2:
return 0, 1
if n == 3:
return (m-1)%p, (m-2)%p
if n % 2 == 0:
S, D = C(n-1, m, p)
return ((m-1) * D)%p, (S + (m-2) * D)%p
else:
S, D = C((n-1)//2+1, m, p)
return (S*S + (m-1)*D*D)%p, (2*S*D + (m-2)*D*D)%p
Note that in this code, C(n, m, p) returns two numbers -- S(n)%p and D(n)%p.
For example:
>>> p = 2**64 - 59 # Some large prime
>>> print(C(4, 128, p))
>>> print(C(5, 128, p))
>>> print(C(10**15, 128, p))
(16002, 16003)
(2032381, 2032380)
(12557489471374801501, 12557489471374801502)
Looking at these examples, it seems like D(n) = S(n) + (-1)^n. If that's true, the code can be simplified a bit I guess.
Another, perhaps easier, way to do it efficiently is to use a matrix and the first set of equations. (Sorry for the ascii art -- this diagram is a vector = matrix * vector):
(D(n)) = (M-2 1) * (D(n-1))
(S(n)) = (M-1 0) (S(n-1))
Telescoping this, and using that D(2)=1, S(2)=0:
(D(n)) = (M-2 1)^(n-2) (1)
(S(n)) = (M-1 0) (0)
You can perform the matrix power using exponentiation by squaring in O(log n) time.
Here's working code, including the examples (which you can check produce the same values as the code above). Most of the code is actually matrix multiply and matrix power -- you can probably replace a lot of it with numpy code if you use that package.
def mat_mul(M, N, p):
R = [[0, 0], [0, 0]]
for i in range(2):
for j in range(2):
for k in range(2):
R[i][j] += M[i][k] * N[k][j]
R[i][j] %= p
return R
def mat_pow(M, n, p):
if n == 0:
return [[1, 0], [0, 1]]
if n == 1:
return M
if n % 2 == 0:
R = mat_pow(M, n//2, p)
return mat_mul(R, R, p)
return mat_mul(M, mat_pow(M, n-1, p), p)
def Cmat(n, m, p):
M = [((m-2), 1), (m-1, 0)]
M = mat_pow(M, n-2, p)
return M[1][0], M[0][0]
p = 2**64 - 59
print(Cmat(4, 128, p))
print(Cmat(5, 128, p))
print(Cmat(10**15, 128, p))
You only need to count the number of acceptable sequences, not find them explicitly. It turns out that it doesn't matter what the majority of the characters are. There are only 4 kinds of characters that matter:
The first character
The last character
The last-used character, so you don't repeat characters consecutively
All other characters
In other words, you don't need to iterate over all 10^15 characters. You only need to consider the four cases above, since most characters can be lumped together into the last case.

Number equal to the sum of powers of its digits

I've got another interesing programming/mathematical problem.
For a given natural number q from interval [2; 10000] find the number n
which is equal to sum of q-th powers of its digits modulo 2^64.
for example: for q=3, n=153; for q=5, n=4150.
I wasn't sure if this problem fits more to math.se or stackoverflow, but this was a programming task which my friend told me quite a long time ago. Now I remembered that and would like to know how such things can be done. How to approach this?
There are two key points,
the range of possible solutions is bounded,
any group of numbers whose digits are the same up to permutation con contain at most one solution.
Let us take a closer look at the case q = 2. If a d-digit number n is equal to the sum of the squares of its digits, then
n >= 10^(d-1) // because it's a d-digit number
n <= d*9^2 // because each digit is at most 9
and the condition 10^(d-1) <= d*81 is easily translated to d <= 3 or n < 1000. That's not many numbers to check, a brute-force for those is fast. For q = 3, the condition 10^(d-1) <= d*729 yields d <= 4, still not many numbers to check. We could find smaller bounds by analysing further, for q = 2, the sum of the squares of at most three digits is at most 243, so a solution must be less than 244. The maximal sum of squares of digits in that range is reached for 199: 1² + 9² + 9² = 163, continuing, one can easily find that a solution must be less than 100. (The only solution for q = 2 is 1.) For q = 3, the maximal sum of four cubes of digits is 4*729 = 2916, continuing, we can see that all solutions for q = 3 are less than 1000. But that sort of improvement of the bound is only useful for small exponents due to the modulus requirement. When the sum of the powers of the digits can exceed the modulus, it breaks down. Therefore I stop at finding the maximal possible number of digits.
Now, without the modulus, for the sum of the q-th powers of the digits, the bound would be approximately
q - (q/20) + 1
so for larger q, the range of possible solutions obtained from that is huge.
But two points come to the rescue here, first the modulus, which limits the solution space to 2 <= n < 2^64, at most 20 digits, and second, the permutation-invariance of the (modular) digital power sum.
The permutation invariance means that we only need to construct monotonous sequences of d digits, calculate the sum of the q-th powers and check whether the number thus obtained has the correct digits.
Since the number of monotonous d-digit sequences is comparably small, a brute-force using that becomes feasible. In particular if we ignore digits not contributing to the sum (0 for all exponents, 8 for q >= 22, also 4 for q >= 32, all even digits for q >= 64).
The number of monotonous sequences of length d using s symbols is
binom(s+d-1, d)
s is for us at most 9, d <= 20, summing from d = 1 to d = 20, there are at most 10015004 sequences to consider for each exponent. That's not too much.
Still, doing that for all q under consideration amounts to a long time, but if we take into account that for q >= 64, for all even digits x^q % 2^64 == 0, we need only consider sequences composed of odd digits, and the total number of monotonous sequences of length at most 20 using 5 symbols is binom(20+5,20) - 1 = 53129. Now, that looks good.
Summary
We consider a function f mapping digits to natural numbers and are looking for solutions of the equation
n == (sum [f(d) | d <- digits(n)] `mod` 2^64)
where digits maps n to the list of its digits.
From f, we build a function F from lists of digits to natural numbers,
F(list) = sum [f(d) | d <- list] `mod` 2^64
Then we are looking for fixed points of G = F ∘ digits. Now n is a fixed point of G if and only if digits(n) is a fixed point of H = digits ∘ F. Hence we may equivalently look for fixed points of H.
But F is permutation-invariant, so we can restrict ourselves to sorted lists and consider K = sort ∘ digits ∘ F.
Fixed points of H and of K are in one-to-one correspondence. If list is a fixed point of H, then sort(list) is a fixed point of K, and if sortedList is a fixed point of K, then H(sortedList) is a permutation of sortedList, hence H(H(sortedList)) = H(sortedList), in other words, H(sortedList) is a fixed point of K, and sort resp. H are bijections between the set of fixed points of H and K.
A further improvement is possible if some f(d) are 0 (modulo 264). Let compress be a function that removes digits with f(d) mod 2^64 == 0 from a list of digits and consider the function L = compress ∘ K.
Since F ∘ compress = F, if list is a fixed point of K, then compress(list) is a fixed point of L. Conversely, if clist is a fixed point of L, then K(clist) is a fixed point of K, and compress resp. K are bijections between the sets of fixed points of L resp. K. (And H(clist) is a fixed point of H, and compress ∘ sort resp. H are bijections between the sets of fixed points of L resp. H.)
The space of compressed sorted lists of at most d digits is small enough to brute-force for the functions f under consideration, namely power functions.
So the strategy is:
Find the maximal number d of digits to consider (bounded by 20 due to the modulus, smaller for small q).
Generate the compressed monotonic sequences of up to d digits.
Check whether the sequence is a fixed point of L, if it is, F(sequence) is a fixed point of G, i.e. a solution of the problem.
Code
Fortunately, you haven't specified a language, so I went for the option of simplest code, i.e. Haskell:
{-# LANGUAGE CPP #-}
module Main (main) where
import Data.List
import Data.Array.Unboxed
import Data.Word
import Text.Printf
#include "MachDeps.h"
#if WORD_SIZE_IN_BITS == 64
type UINT64 = Word
#else
type UINT64 = Word64
#endif
maxDigits :: UINT64 -> Int
maxDigits mx = min 20 $ go d0 (10^(d0-1)) start
where
d0 = floor (log (fromIntegral mx) / log 10) + 1
mxi :: Integer
mxi = fromIntegral mx
start = mxi * fromIntegral d0
go d p10 mmx
| p10 > mmx = d-1
| otherwise = go (d+1) (p10*10) (mmx+mxi)
sortedDigits :: UINT64 -> [UINT64]
sortedDigits = sort . digs
where
digs 0 = []
digs n = case n `quotRem` 10 of
(q,r) -> r : digs q
generateSequences :: Int -> [a] -> [[a]]
generateSequences 0 _
= [[]]
generateSequences d [x]
= [replicate d x]
generateSequences d (x:xs)
= [replicate k x ++ tl | k <- [d,d-1 .. 0], tl <- generateSequences (d-k) xs]
generateSequences _ _ = []
fixedPoints :: (UINT64 -> UINT64) -> [UINT64]
fixedPoints digFun = sort . map listNum . filter okSeq $
[ds | d <- [1 .. mxdigs], ds <- generateSequences d contDigs]
where
funArr :: UArray UINT64 UINT64
funArr = array (0,9) [(i,digFun i) | i <- [0 .. 9]]
mxval = maximum (elems funArr)
contDigs = filter ((/= 0) . (funArr !)) [0 .. 9]
mxdigs = maxDigits mxval
listNum = sum . map (funArr !)
numFun = listNum . sortedDigits
listFun = inter . sortedDigits . listNum
inter = go contDigs
where
go cds#(c:cs) dds#(d:ds)
| c < d = go cs dds
| c == d = c : go cds ds
| otherwise = go cds ds
go _ _ = []
okSeq ds = ds == listFun ds
solve :: Int -> IO ()
solve q = do
printf "%d:\n " q
print (fixedPoints (^q))
main :: IO ()
main = mapM_ solve [2 .. 10000]
It's not optimised, but as is, it finds all solutions for 2 <= q <= 10000 in a little below 50 minutes on my box, starting with
2:
[1]
3:
[1,153,370,371,407]
4:
[1,1634,8208,9474]
5:
[1,4150,4151,54748,92727,93084,194979]
6:
[1,548834]
7:
[1,1741725,4210818,9800817,9926315,14459929]
8:
[1,24678050,24678051,88593477]
9:
[1,146511208,472335975,534494836,912985153]
10:
[1,4679307774]
11:
[1,32164049650,32164049651,40028394225,42678290603,44708635679,49388550606,82693916578,94204591914]
And ending with
9990:
[1,12937422361297403387,15382453639294074274]
9991:
[1,16950879977792502812]
9992:
[1,2034101383512968938]
9993:
[1]
9994:
[1,9204092726570951194,10131851145684339988]
9995:
[1]
9996:
[1,10606560191089577674,17895866689572679819]
9997:
[1,8809232686506786849]
9998:
[1]
9999:
[1]
10000:
[1,11792005616768216715]
The exponents from about 10 to 63 take longest (individually, not cumulative), there's a remarkable speedup from exponent 64 on due to the reduced search space.
Here is a brute force solution that will solve for all such n, including 1 and any other n greater than the first within whatever range you choose (in this case I chose base^q as my range limit). You could modify to ignore the special case of 1 and also to return after the first result. It's in C#, but might look nicer in a language with a ** exponentiation operator. You could also pass in your q and base as parameters.
int q = 5;
int radix = 10;
for (int input = 1; input < (int)Math.Pow(radix, q); input++)
{
int sum = 0;
for (int i = 1; i < (int)Math.Pow(radix, q); i *= radix)
{
int x = input / i % radix; //get current digit
sum += (int)Math.Pow(x, q); //x**q;
}
if (sum == input)
{
Console.WriteLine("Hooray: {0}", input);
}
}
So, for q = 5 the results are:
Hooray: 1
Hooray: 4150
Hooray: 4151
Hooray: 54748
Hooray: 92727
Hooray: 93084

Pseudo number generation

Following is text from Data structure and algorithm analysis by Mark Allen Wessis.
Following x(i+1) should be read as x subscript of i+1, and x(i) should be
read as x subscript i.
x(i + 1) = (a*x(i))mod m.
It is also common to return a random real number in the open interval
(0, 1) (0 and 1 are not possible values); this can be done by
dividing by m. From this, a random number in any closed interval [a,
b] can be computed by normalizing.
The problem with this routine is that the multiplication could
overflow; although this is not an error, it affects the result and
thus the pseudo-randomness. Schrage gave a procedure in which all of
the calculations can be done on a 32-bit machine without overflow. We
compute the quotient and remainder of m/a and define these as q and
r, respectively.
In our case for M=2,147,483,647 A =48,271, q = 127,773, r = 2,836, and r < q.
We have
x(i + 1) = (a*x(i))mod m.---------------------------> Eq 1.
= ax(i) - m (floorof(ax(i)/m)).------------> Eq 2
Also author is mentioning about:
x(i) = q(floor of(x(i)/q)) + (x(i) mod Q).--->Eq 3
My question
what does author mean by random number is computed by normalizing?
How author came with Eq 2 from Eq 1?
How author came with Eq 3?
Normalizing means if you have X ∈ [0,1] and you need to get Y ∈ [a, b] you can compute
Y = a + X * (b - a)
EDIT:
2. Let's suppose
a = 3, x = 5, m = 9
Then we have
where [ax/m] means an integer part.
So we have 15 = [ax/m]*m + 6
We need to get 6. 15 - [ax/m]*m = 6 => ax - [ax/m]*m = 6 => x(i+1) = ax(i) - [ax(i)/m]*m
If you have a random number in the range [0,1], you can get a number in the range [2,5] (for example) by multiplying by 3 and adding 2.

Randomly Generate a set of numbers of n length totaling x

I'm working on a project for fun and I need an algorithm to do as follows:
Generate a list of numbers of Length n which add up to x
I would settle for list of integers, but ideally, I would like to be left with a set of floating point numbers.
I would be very surprised if this problem wasn't heavily studied, but I'm not sure what to look for.
I've tackled similar problems in the past, but this one is decidedly different in nature. Before I've generated different combinations of a list of numbers that will add up to x. I'm sure that I could simply bruteforce this problem but that hardly seems like the ideal solution.
Anyone have any idea what this may be called, or how to approach it? Thanks all!
Edit: To clarify, I mean that the list should be length N while the numbers themselves can be of any size.
edit2: Sorry for my improper use of 'set', I was using it as a catch all term for a list or an array. I understand that it was causing confusion, my apologies.
This is how to do it in Python
import random
def random_values_with_prescribed_sum(n, total):
x = [random.random() for i in range(n)]
k = total / sum(x)
return [v * k for v in x]
Basically you pick n random numbers, compute their sum and compute a scale factor so that the sum will be what you want it to be.
Note that this approach will not produce "uniform" slices, i.e. the distribution you will get will tend to be more "egalitarian" than it should be if it was picked at random among all distribution with the given sum.
To see the reason you can just picture what the algorithm does in the case of two numbers with a prescribed sum (e.g. 1):
The point P is a generic point obtained by picking two random numbers and it will be uniform inside the square [0,1]x[0,1]. The point Q is the point obtained by scaling P so that the sum is required to be 1. As it's clear from the picture the points close to the center of the have an higher probability; for example the exact center of the squares will be found by projecting any point on the diagonal (0,0)-(1,1), while the point (0, 1) will be found projecting only points from (0,0)-(0,1)... the diagonal length is sqrt(2)=1.4142... while the square side is only 1.0.
Actually, you need to generate a partition of x into n parts. This is usually done the in following way: The partition of x into n non-negative parts can be represented in the following way: reserve n + x free places, put n borders to some arbitrary places, and stones to the rest. The stone groups add up to x, thus the number of possible partitions is the binomial coefficient (n + x \atop n).
So your algorithm could be as follows: choose an arbitrary n-subset of (n + x)-set, it determines uniquely a partition of x into n parts.
In Knuth's TAOCP the chapter 3.4.2 discusses random sampling. See Algortihm S there.
Algorithm S: (choose n arbitrary records from total of N)
t = 0, m = 0;
u = random, uniformly distributed on (0, 1)
if (N - t)*u >= n - m, skip t-th record and increase t by 1; otherwise include t-th record in the sample, increase m and t by 1
if M < n, return to 2, otherwise, algorithm finished
The solution for non-integers is algorithmically trivial: you just select arbitrary n numbers that don't sum up to 0, and norm them by their sum.
If you want to sample uniformly in the region of N-1-dimensional space defined by x1 + x2 + ... + xN = x, then you're looking at a special case of sampling from a Dirichlet distribution. The sampling procedure is a little more involved than generating uniform deviates for the xi. Here's one way to do it, in Python:
xs = [random.gammavariate(1,1) for a in range(N)]
xs = [x*v/sum(xs) for v in xs]
If you don't care too much about the sampling properties of your results, you can just generate uniform deviates and correct their sum afterwards.
Here is a version of the above algorithm in Javascript
function getRandomArbitrary(min, max) {
return Math.random() * (max - min) + min;
};
function getRandomArray(min, max, n) {
var arr = [];
for (var i = 0, l = n; i < l; i++) {
arr.push(getRandomArbitrary(min, max))
};
return arr;
};
function randomValuesPrescribedSum(min, max, n, total) {
var arr = getRandomArray(min, max, n);
var sum = arr.reduce(function(pv, cv) { return pv + cv; }, 0);
var k = total/sum;
var delays = arr.map(function(x) { return k*x; })
return delays;
};
You can call it with
var myarray = randomValuesPrescribedSum(0,1,3,3);
And then check it with
var sum = myarray.reduce(function(pv, cv) { return pv + cv;},0);
This code does a reasonable job. I think it produces a different distribution than 6502's answer, but I am not sure which is better or more natural. Certainly his code is clearer/nicer.
import random
def parts(total_sum, num_parts):
points = [random.random() for i in range(num_parts-1)]
points.append(0)
points.append(1)
points.sort()
ret = []
for i in range(1, len(points)):
ret.append((points[i] - points[i-1]) * total_sum)
return ret
def test(total_sum, num_parts):
ans = parts(total_sum, num_parts)
assert abs(sum(ans) - total_sum) < 1e-7
print ans
test(5.5, 3)
test(10, 1)
test(10, 5)
In python:
a: create a list of (random #'s 0 to 1) times total; append 0 and total to the list
b: sort the list, measure the distance between each element
c: round the list elements
import random
import time
TOTAL = 15
PARTS = 4
PLACES = 3
def random_sum_split(parts, total, places):
a = [0, total] + [random.random()*total for i in range(parts-1)]
a.sort()
b = [(a[i] - a[i-1]) for i in range(1, (parts+1))]
if places == None:
return b
else:
b.pop()
c = [round(x, places) for x in b]
c.append(round(total-sum(c), places))
return c
def tick():
if info.tick == 1:
start = time.time()
alpha = random_sum_split(PARTS, TOTAL, PLACES)
end = time.time()
log('alpha: %s' % alpha)
log('total: %.7f' % sum(alpha))
log('parts: %s' % PARTS)
log('places: %s' % PLACES)
log('elapsed: %.7f' % (end-start))
yields:
[2014-06-13 01:00:00] alpha: [0.154, 3.617, 6.075, 5.154]
[2014-06-13 01:00:00] total: 15.0000000
[2014-06-13 01:00:00] parts: 4
[2014-06-13 01:00:00] places: 3
[2014-06-13 01:00:00] elapsed: 0.0005839
to the best of my knowledge this distribution is uniform

Efficient algorithm to randomly select items with frequency

Given an array of n word-frequency pairs:
[ (w0, f0), (w1, f1), ..., (wn-1, fn-1) ]
where wi is a word, fi is an integer frequencey, and the sum of the frequencies ∑fi = m,
I want to use a pseudo-random number generator (pRNG) to select p words wj0, wj1, ..., wjp-1 such that
the probability of selecting any word is proportional to its frequency:
P(wi = wjk) = P(i = jk) = fi / m
(Note, this is selection with replacement, so the same word could be chosen every time).
I've come up with three algorithms so far:
Create an array of size m, and populate it so the first f0 entries are w0, the next f1 entries are w1, and so on, so the last fp-1 entries are wp-1.[ w0, ..., w0, w1,..., w1, ..., wp-1, ..., wp-1 ]
Then use the pRNG to select p indices in the range 0...m-1, and report the words stored at those indices.
This takes O(n + m + p) work, which isn't great, since m can be much much larger than n.
Step through the input array once, computingmi = ∑h≤ifh = mi-1 + fi
and after computing mi, use the pRNG to generate a number xk in the range 0...mi-1 for each k in 0...p-1
and select wi for wjk (possibly replacing the current value of wjk) if xk < fi.
This requires O(n + np) work.
Compute mi as in algorithm 2, and generate the following array on n word-frequency-partial-sum triples:[ (w0, f0, m0), (w1, f1, m1), ..., (wn-1, fn-1, mn-1) ]
and then, for each k in 0...p-1, use the pRNG to generate a number xk in the range 0...m-1 then do binary search on the array of triples to find the i s.t. mi-fi ≤ xk < mi, and select wi for wjk.
This requires O(n + p log n) work.
My question is: Is there a more efficient algorithm I can use for this, or are these as good as it gets?
This sounds like roulette wheel selection, mainly used for the selection process in genetic/evolutionary algorithms.
Look at Roulette Selection in Genetic Algorithms
You could create the target array, then loop through the words determining the probability that it should be picked, and replace the words in the array according to a random number.
For the first word the probability would be f0/m0 (where mn=f0+..+fn), i.e. 100%, so all positions in the target array would be filled with w0.
For the following words the probability falls, and when you reach the last word the target array is filled with randomly picked words accoding to the frequency.
Example code in C#:
public class WordFrequency {
public string Word { get; private set; }
public int Frequency { get; private set; }
public WordFrequency(string word, int frequency) {
Word = word;
Frequency = frequency;
}
}
WordFrequency[] words = new WordFrequency[] {
new WordFrequency("Hero", 80),
new WordFrequency("Monkey", 4),
new WordFrequency("Shoe", 13),
new WordFrequency("Highway", 3),
};
int p = 7;
string[] result = new string[p];
int sum = 0;
Random rnd = new Random();
foreach (WordFrequency wf in words) {
sum += wf.Frequency;
for (int i = 0; i < p; i++) {
if (rnd.Next(sum) < wf.Frequency) {
result[i] = wf.Word;
}
}
}
Ok, I found another algorithm: the alias method (also mentioned in this answer). Basically it creates a partition of the probability space such that:
There are n partitions, all of the same width r s.t. nr = m.
each partition contains two words in some ratio (which is stored with the partition).
for each word wi, fi = ∑partitions t s.t wi ∈ t r × ratio(t,wi)
Since all the partitions are of the same size, selecting which partition can be done in constant work (pick an index from 0...n-1 at random), and the partition's ratio can then be used to select which word is used in constant work (compare a pRNGed number with the ratio between the two words). So this means the p selections can be done in O(p) work, given such a partition.
The reason that such a partitioning exists is that there exists a word wi s.t. fi < r, if and only if there exists a word wi' s.t. fi' > r, since r is the average of the frequencies.
Given such a pair wi and wi' we can replace them with a pseudo-word w'i of frequency f'i = r (that represents wi with probability fi/r and wi' with probability 1 - fi/r) and a new word w'i' of adjusted frequency f'i' = fi' - (r - fi) respectively. The average frequency of all the words will still be r, and the rule from the prior paragraph still applies. Since the pseudo-word has frequency r and is made of two words with frequency ≠ r, we know that if we iterate this process, we will never make a pseudo-word out of a pseudo-word, and such iteration must end with a sequence of n pseudo-words which are the desired partition.
To construct this partition in O(n) time,
go through the list of the words once, constructing two lists:
one of words with frequency ≤ r
one of words with frequency > r
then pull a word from the first list
if its frequency = r, then make it into a one element partition
otherwise, pull a word from the other list, and use it to fill out a two-word partition. Then put the second word back into either the first or second list according to its adjusted frequency.
This actually still works if the number of partitions q > n (you just have to prove it differently). If you want to make sure that r is integral, and you can't easily find a factor q of m s.t. q > n, you can pad all the frequencies by a factor of n, so f'i = nfi, which updates m' = mn and sets r' = m when q = n.
In any case, this algorithm only takes O(n + p) work, which I have to think is optimal.
In ruby:
def weighted_sample_with_replacement(input, p)
n = input.size
m = input.inject(0) { |sum,(word,freq)| sum + freq }
# find the words with frequency lesser and greater than average
lessers, greaters = input.map do |word,freq|
# pad the frequency so we can keep it integral
# when subdivided
[ word, freq*n ]
end.partition do |word,adj_freq|
adj_freq <= m
end
partitions = Array.new(n) do
word, adj_freq = lessers.shift
other_word = if adj_freq < m
# use part of another word's frequency to pad
# out the partition
other_word, other_adj_freq = greaters.shift
other_adj_freq -= (m - adj_freq)
(other_adj_freq <= m ? lessers : greaters) << [ other_word, other_adj_freq ]
other_word
end
[ word, other_word , adj_freq ]
end
(0...p).map do
# pick a partition at random
word, other_word, adj_freq = partitions[ rand(n) ]
# select the first word in the partition with appropriate
# probability
if rand(m) < adj_freq
word
else
other_word
end
end
end

Resources