Number of Ways To arrange Sequence - algorithm

I am having a M character, from these character i need to make a sequence of length N such that no two consecutive character are same and also first and last character of the sequence is fix. So i need to find the total number of ways.
My Approach:
Dynamic programming.
If first and last character are '0' and '1'
dp[1][0]=1 , dp[1][1]=1
for(int i=2;i<N;i++)
for(int j=0;j<M;j++)
for(int k=0;k<M;k++)
if(j!=k) dp[i][j]+=dp[i-1][k]
So final answer would summation dp[n-1][i] , i!=1
Problem:
Here length N is too large around 10^15 and M is around 128, how find the number of permutation without using arrays ?

Assume M is fixed. Let D(n) be the number of sequences of length n with no repeated characters where the first and last character differ (but are fixed). Let S(n) be the number of sequences of length n where the first and last characters are the same (but are fixed).
For example, D(6) is the number of strings of the form a????b (for some a and b -- noting that for counting it doesn't matter which two characters we chose, and where the ? represent other characters). Similarly, S(6) is the number of strings of the form a????a.
Consider a sequence of length n>3 of the form a....?b. The ? can be any of m-1 characters (anything except b). One of these is a. So D(n) = S(n-1) + (m-2)D(n-1). Using a similar argument, one can figure out that S(n) = (M-1)D(n-1).
For example, how many strings are there of the form a??b? Well, the character just before the b could be a or something else. How many strings are there when it's a? Well, it's the same as the number of strings of the form a?a. How many strings are there when it's something else? Well it's the same as the number of strings of the form a?c multiplied by the number of choices we had for c (namely: m-2 -- everything except for a which we've already counted, and b which is excluded by the rules).
If n is odd, we can consider the middle character. Consider a sequence of length n of the form a...?...b. The ? (which is in the center of the string) can be a, b, or one of the other M-2 characters. Thus D(2n+1) = S(n+1)D(n+1) + D(n+1)S(n+1) + (M-2)D(n+1)D(n+1). Similarly, S(2n+1) = S(n+1)S(n+1) + (M-1)D(n+1)D(n+1).
For small n, S(2)=0, S(3)=M-1, D(2)=1, D(3)=M-2.
We can use the above equations (the first set for even n>3, the second set for odd n>3, and the base cases for n=2 or 3 to compute the result you need in O(log N) arithmetic operations. Presumably the question asks you to compute the result modulo something (since the result grows like O(M^(N-2)), but that's easy to incorporate into the results.
Working code that uses this approach:
def C(n, m, p):
if n == 2:
return 0, 1
if n == 3:
return (m-1)%p, (m-2)%p
if n % 2 == 0:
S, D = C(n-1, m, p)
return ((m-1) * D)%p, (S + (m-2) * D)%p
else:
S, D = C((n-1)//2+1, m, p)
return (S*S + (m-1)*D*D)%p, (2*S*D + (m-2)*D*D)%p
Note that in this code, C(n, m, p) returns two numbers -- S(n)%p and D(n)%p.
For example:
>>> p = 2**64 - 59 # Some large prime
>>> print(C(4, 128, p))
>>> print(C(5, 128, p))
>>> print(C(10**15, 128, p))
(16002, 16003)
(2032381, 2032380)
(12557489471374801501, 12557489471374801502)
Looking at these examples, it seems like D(n) = S(n) + (-1)^n. If that's true, the code can be simplified a bit I guess.
Another, perhaps easier, way to do it efficiently is to use a matrix and the first set of equations. (Sorry for the ascii art -- this diagram is a vector = matrix * vector):
(D(n)) = (M-2 1) * (D(n-1))
(S(n)) = (M-1 0) (S(n-1))
Telescoping this, and using that D(2)=1, S(2)=0:
(D(n)) = (M-2 1)^(n-2) (1)
(S(n)) = (M-1 0) (0)
You can perform the matrix power using exponentiation by squaring in O(log n) time.
Here's working code, including the examples (which you can check produce the same values as the code above). Most of the code is actually matrix multiply and matrix power -- you can probably replace a lot of it with numpy code if you use that package.
def mat_mul(M, N, p):
R = [[0, 0], [0, 0]]
for i in range(2):
for j in range(2):
for k in range(2):
R[i][j] += M[i][k] * N[k][j]
R[i][j] %= p
return R
def mat_pow(M, n, p):
if n == 0:
return [[1, 0], [0, 1]]
if n == 1:
return M
if n % 2 == 0:
R = mat_pow(M, n//2, p)
return mat_mul(R, R, p)
return mat_mul(M, mat_pow(M, n-1, p), p)
def Cmat(n, m, p):
M = [((m-2), 1), (m-1, 0)]
M = mat_pow(M, n-2, p)
return M[1][0], M[0][0]
p = 2**64 - 59
print(Cmat(4, 128, p))
print(Cmat(5, 128, p))
print(Cmat(10**15, 128, p))

You only need to count the number of acceptable sequences, not find them explicitly. It turns out that it doesn't matter what the majority of the characters are. There are only 4 kinds of characters that matter:
The first character
The last character
The last-used character, so you don't repeat characters consecutively
All other characters
In other words, you don't need to iterate over all 10^15 characters. You only need to consider the four cases above, since most characters can be lumped together into the last case.

Related

Remove the inferior digits of a number

Given a number n of x digits. How to remove y digits in a way the remaining digits results in the greater possible number?
Examples:
1)x=7 y=3
n=7816295
-8-6-95
=8695
2)x=4 y=2
n=4213
4--3
=43
3)x=3 y=1
n=888
=88
Just to state: x > y > 0.
For each digit to remove: iterate through the digits left to right; if you find a digit that's less than the one to its right, remove it and stop, otherwise remove the last digit.
If the number of digits x is greater than the actual length of the number, it means there are leading zeros. Since those will be the first to go, you can simply reduce the count y by a corresponding amount.
Here's a working version in Python:
def remove_digits(n, x, y):
s = str(n)
if len(s) > x:
raise ValueError
elif len(s) < x:
y -= x - len(s)
if y <= 0:
return n
for r in range(y):
for i in range(len(s)):
if s[i] < s[i+1:i+2]:
break
s = s[:i] + s[i+1:]
return int(s)
>>> remove_digits(7816295, 7, 3)
8695
>>> remove_digits(4213, 4, 2)
43
>>> remove_digits(888, 3, 1)
88
I hesitated to submit this, because it seems too simple. But I wasn't able to think of a case where it wouldn't work.
if x = y we have to remove all the digits.
Otherwise, you need to find maximum digit in first y + 1 digits. Then remove all the y0 elements before this maximum digit. Then you need to add that maximum to the answer and then repeat that task again, but you need now to remove y - y0 elements now.
Straight forward implementation will work in O(x^2) time in the worst case.
But finding maximum in the given range can be done effectively using Segment Tree data structure. Time complexity will be O(x * log(x)) in the worst case.
P. S. I just realized, that it possible to solve in O(x) also, using the fact, that exists only 10 digits (but the algorithm maybe a little bit complicated). We need to find the minimum in the given range [L, R], but the ranges in this task will "change" from left to the right (L and R always increase). And we just need to store 10 pointers to the digits (1 per digit) to the first position in the number such that position >= L. Then to find the minimum, we need to check only 10 pointers. To update the pointers, we will try to move them right.
So the time complexity will be O(10 * x) = O(x)
Here's an O(x) solution. It builds an index that maps (i, d) to j, the smallest number > i such that the j'th digit of n is d. With this index, one can easily find the largest possible next digit in the solution in O(1) time.
def index(digits):
next = [len(digits)+1] * 10
for i in xrange(len(digits), 0, -1):
next[ord(digits[i-1])-ord('0')] = i-1
yield next[::-1]
def minseq(n, y):
n = str(n)
idx = list(index(n))[::-1]
i, r = 0, []
for ry in xrange(len(n)-y):
i = next(j for j in idx[i] if j <= y+ry) + 1
r.append(n[i - 1])
return ''.join(r)
print minseq(7816295, 3)
print minseq(4213, 2)
Pseudocode:
Number.toDigits().filter (sortedSet (Number.toDigits()). take (y))
Imho you don't need to know x.
For efficiency, Number.toDigits () could be precalculated
digits = Number.toDigits()
digits.filter (sortedSet (digits).take (y))
Depending on language and context, you either output the digits and are done or have to convert the result into a number again.
Working Scala-Code for example:
def toDigits (l: Long) : List [Long] = if (l < 10) l :: Nil else (toDigits (l /10)) :+ (l % 10)
val num = 734529L
val dig = toDigits (num)
dig.filter (_ > ((dig.sorted).take(2).last))
A sorted set is a set which is sorted, which means, every element is only contained once and then the resulting collection is sorted by some criteria, for example numerical ascending. => 234579.
We take two of them (23) and from that subset the last (3) and filter the number by the criteria, that the digits have to be greater than that value (3).
Your question does not explicitly say, that each digit is only contained once in the original number, but since you didn't give a criterion, which one to remove in doubt, I took it as an implicit assumption.
Other languages may of course have other expressions (x.sorted, x.toSortedSet, new SortedSet (num), ...) or lack certain classes, functions, which you would have to build on your own.
You might need to write your own filter method, which takes a pedicate P, and a collection C, and returns a new collection of all elements which satisfy P, P being a Method which takes one T and returns a Boolean. Very useful stuff.

Maximal result of expression by putting braces - algorithm

I have expression consisting of numbers separated with plus and minus sign.
I need to get the maximal result of this expression by putting braces between the numbers.
I'm trying to get polynomial algorithm for this problem, but I need some advice or hint how to achieve it.
I've found something similar here, but I don't know how to modify it.
EDIT:
I was thinking that the idea could be similar like this
getMax(inp)
{
if(|inp| == 1)
return inp[1] // base case
else
val = 0;
for(k=2; k < |inp|; k+=2)
val = max{val, getMax(inp[:k]) + getMax(inp[k+1:])}
}
One strategy is to use dynamic programming to choose the best operation to perform last. This divides the expression in two parts.
If the operation is addition, you call recursively on each part to find the maximum for each part.
If the operation is subtraction, you want to find the maximum on the first part and the minimum on the second part.
Here is some non-memoized code, just to show how the recurrence works (note that i iterates only on the indices of the operations, to choose the best place to break the expression):
import re
def T(s, f1=max, f2=min):
if len(s) == 1:
return int(s[0])
return f1(
T(s[:i], f1, f2)+T(s[i+1:], f1, f2)
if s[i]=='+' else
T(s[:i], f1, f2)-T(s[i+1:], f2, f1)
for i in xrange(1, len(s), 2))
def solve(expr):
return T(re.split('([+-])', expr))
print solve('1-2+1') #0 ((1-2)+1)
print solve('1-22-23') #2 (1-(22-23))
Implementing a bottom-up dynamic programming is a little more tricky, as the ideal order to fill the table is somewhat non-conventional. The easiest way is to make the DP around T(k, i) that denotes "maximum/minimum for expressions of k operands starting at the ith operand". Using Anonymous idea of separating operators and numbers in O and N respectively, an example code would be:
import re
def T(O, N):
n1 = len(N)+1 #maximum expression length
Tmax = [[float('-Inf')]*len(N) for _ in xrange(n1)]
Tmin = [[float('+Inf')]*len(N) for _ in xrange(n1)]
for i, n in enumerate(N): #only the numbers
Tmax[1][i] = Tmin[1][i] = int(n)
for k in xrange(2, n1):
for i in xrange(n1-k):
for j in xrange(1, k):
if (O[i+j-1] == '+'):
Tmax[k][i] = max(Tmax[k][i], Tmax[j][i]+Tmax[k-j][i+j])
Tmin[k][i] = min(Tmin[k][i], Tmin[j][i]+Tmin[k-j][i+j])
else:
Tmax[k][i] = max(Tmax[k][i], Tmax[j][i]-Tmin[k-j][i+j])
Tmin[k][i] = min(Tmin[k][i], Tmin[j][i]-Tmax[k-j][i+j])
return Tmax[len(N)][0]
def solve(expr):
A = re.split('([+-])', expr)
return T(A[1::2], A[::2])
print solve('1+1') #2
print solve('1-2+1') #0 ((1-2)+1)
print solve('1-22-23') #2 (1-(22-23))
Let the operators be O[0], O[1], ..., O[K-1]. Let the numbers be N[0], N[1], ..., N[K]. (There's one more number than operator).
Let M[op, i, j] be the largest value achievable from the sub-expression starting from number i and ending from number j (inclusive, both ends) if op is +, and the smallest value if op is -.
Thus M[+, 0, K] is maximum value the whole expression can take.
M satisfies a recurrence relation:
M[+, i, i] = M[-, i, i] = N[i]
M[+, i, j] = max(M[+, i, k] O[k] M[O[k], k+1, j) for k in i...j-1)
M[-, i, j] = min(M[-, i, k] O[k] M[-O[k], k+1, j) for k in i...j-1)
Here A O[k] B means A + B or A - B depending on O[k], and -O[k] means - if O[k] is +, and + if O[k] is -.
Basically, you're trying to find the best place to split the expression to either maximise or minimise the overall result. When you consider a - operator, you switch from maximising to minimising and vice versa on the right-hand-side.
These recurrence relations can be evaluated using dynamic programming in a direct way, by building a 3 dimensional table for M of size 2 * (K+1) * (K+1), where K is the number of operators.
Overall, this algorithm is O(K^3).

Number equal to the sum of powers of its digits

I've got another interesing programming/mathematical problem.
For a given natural number q from interval [2; 10000] find the number n
which is equal to sum of q-th powers of its digits modulo 2^64.
for example: for q=3, n=153; for q=5, n=4150.
I wasn't sure if this problem fits more to math.se or stackoverflow, but this was a programming task which my friend told me quite a long time ago. Now I remembered that and would like to know how such things can be done. How to approach this?
There are two key points,
the range of possible solutions is bounded,
any group of numbers whose digits are the same up to permutation con contain at most one solution.
Let us take a closer look at the case q = 2. If a d-digit number n is equal to the sum of the squares of its digits, then
n >= 10^(d-1) // because it's a d-digit number
n <= d*9^2 // because each digit is at most 9
and the condition 10^(d-1) <= d*81 is easily translated to d <= 3 or n < 1000. That's not many numbers to check, a brute-force for those is fast. For q = 3, the condition 10^(d-1) <= d*729 yields d <= 4, still not many numbers to check. We could find smaller bounds by analysing further, for q = 2, the sum of the squares of at most three digits is at most 243, so a solution must be less than 244. The maximal sum of squares of digits in that range is reached for 199: 1² + 9² + 9² = 163, continuing, one can easily find that a solution must be less than 100. (The only solution for q = 2 is 1.) For q = 3, the maximal sum of four cubes of digits is 4*729 = 2916, continuing, we can see that all solutions for q = 3 are less than 1000. But that sort of improvement of the bound is only useful for small exponents due to the modulus requirement. When the sum of the powers of the digits can exceed the modulus, it breaks down. Therefore I stop at finding the maximal possible number of digits.
Now, without the modulus, for the sum of the q-th powers of the digits, the bound would be approximately
q - (q/20) + 1
so for larger q, the range of possible solutions obtained from that is huge.
But two points come to the rescue here, first the modulus, which limits the solution space to 2 <= n < 2^64, at most 20 digits, and second, the permutation-invariance of the (modular) digital power sum.
The permutation invariance means that we only need to construct monotonous sequences of d digits, calculate the sum of the q-th powers and check whether the number thus obtained has the correct digits.
Since the number of monotonous d-digit sequences is comparably small, a brute-force using that becomes feasible. In particular if we ignore digits not contributing to the sum (0 for all exponents, 8 for q >= 22, also 4 for q >= 32, all even digits for q >= 64).
The number of monotonous sequences of length d using s symbols is
binom(s+d-1, d)
s is for us at most 9, d <= 20, summing from d = 1 to d = 20, there are at most 10015004 sequences to consider for each exponent. That's not too much.
Still, doing that for all q under consideration amounts to a long time, but if we take into account that for q >= 64, for all even digits x^q % 2^64 == 0, we need only consider sequences composed of odd digits, and the total number of monotonous sequences of length at most 20 using 5 symbols is binom(20+5,20) - 1 = 53129. Now, that looks good.
Summary
We consider a function f mapping digits to natural numbers and are looking for solutions of the equation
n == (sum [f(d) | d <- digits(n)] `mod` 2^64)
where digits maps n to the list of its digits.
From f, we build a function F from lists of digits to natural numbers,
F(list) = sum [f(d) | d <- list] `mod` 2^64
Then we are looking for fixed points of G = F ∘ digits. Now n is a fixed point of G if and only if digits(n) is a fixed point of H = digits ∘ F. Hence we may equivalently look for fixed points of H.
But F is permutation-invariant, so we can restrict ourselves to sorted lists and consider K = sort ∘ digits ∘ F.
Fixed points of H and of K are in one-to-one correspondence. If list is a fixed point of H, then sort(list) is a fixed point of K, and if sortedList is a fixed point of K, then H(sortedList) is a permutation of sortedList, hence H(H(sortedList)) = H(sortedList), in other words, H(sortedList) is a fixed point of K, and sort resp. H are bijections between the set of fixed points of H and K.
A further improvement is possible if some f(d) are 0 (modulo 264). Let compress be a function that removes digits with f(d) mod 2^64 == 0 from a list of digits and consider the function L = compress ∘ K.
Since F ∘ compress = F, if list is a fixed point of K, then compress(list) is a fixed point of L. Conversely, if clist is a fixed point of L, then K(clist) is a fixed point of K, and compress resp. K are bijections between the sets of fixed points of L resp. K. (And H(clist) is a fixed point of H, and compress ∘ sort resp. H are bijections between the sets of fixed points of L resp. H.)
The space of compressed sorted lists of at most d digits is small enough to brute-force for the functions f under consideration, namely power functions.
So the strategy is:
Find the maximal number d of digits to consider (bounded by 20 due to the modulus, smaller for small q).
Generate the compressed monotonic sequences of up to d digits.
Check whether the sequence is a fixed point of L, if it is, F(sequence) is a fixed point of G, i.e. a solution of the problem.
Code
Fortunately, you haven't specified a language, so I went for the option of simplest code, i.e. Haskell:
{-# LANGUAGE CPP #-}
module Main (main) where
import Data.List
import Data.Array.Unboxed
import Data.Word
import Text.Printf
#include "MachDeps.h"
#if WORD_SIZE_IN_BITS == 64
type UINT64 = Word
#else
type UINT64 = Word64
#endif
maxDigits :: UINT64 -> Int
maxDigits mx = min 20 $ go d0 (10^(d0-1)) start
where
d0 = floor (log (fromIntegral mx) / log 10) + 1
mxi :: Integer
mxi = fromIntegral mx
start = mxi * fromIntegral d0
go d p10 mmx
| p10 > mmx = d-1
| otherwise = go (d+1) (p10*10) (mmx+mxi)
sortedDigits :: UINT64 -> [UINT64]
sortedDigits = sort . digs
where
digs 0 = []
digs n = case n `quotRem` 10 of
(q,r) -> r : digs q
generateSequences :: Int -> [a] -> [[a]]
generateSequences 0 _
= [[]]
generateSequences d [x]
= [replicate d x]
generateSequences d (x:xs)
= [replicate k x ++ tl | k <- [d,d-1 .. 0], tl <- generateSequences (d-k) xs]
generateSequences _ _ = []
fixedPoints :: (UINT64 -> UINT64) -> [UINT64]
fixedPoints digFun = sort . map listNum . filter okSeq $
[ds | d <- [1 .. mxdigs], ds <- generateSequences d contDigs]
where
funArr :: UArray UINT64 UINT64
funArr = array (0,9) [(i,digFun i) | i <- [0 .. 9]]
mxval = maximum (elems funArr)
contDigs = filter ((/= 0) . (funArr !)) [0 .. 9]
mxdigs = maxDigits mxval
listNum = sum . map (funArr !)
numFun = listNum . sortedDigits
listFun = inter . sortedDigits . listNum
inter = go contDigs
where
go cds#(c:cs) dds#(d:ds)
| c < d = go cs dds
| c == d = c : go cds ds
| otherwise = go cds ds
go _ _ = []
okSeq ds = ds == listFun ds
solve :: Int -> IO ()
solve q = do
printf "%d:\n " q
print (fixedPoints (^q))
main :: IO ()
main = mapM_ solve [2 .. 10000]
It's not optimised, but as is, it finds all solutions for 2 <= q <= 10000 in a little below 50 minutes on my box, starting with
2:
[1]
3:
[1,153,370,371,407]
4:
[1,1634,8208,9474]
5:
[1,4150,4151,54748,92727,93084,194979]
6:
[1,548834]
7:
[1,1741725,4210818,9800817,9926315,14459929]
8:
[1,24678050,24678051,88593477]
9:
[1,146511208,472335975,534494836,912985153]
10:
[1,4679307774]
11:
[1,32164049650,32164049651,40028394225,42678290603,44708635679,49388550606,82693916578,94204591914]
And ending with
9990:
[1,12937422361297403387,15382453639294074274]
9991:
[1,16950879977792502812]
9992:
[1,2034101383512968938]
9993:
[1]
9994:
[1,9204092726570951194,10131851145684339988]
9995:
[1]
9996:
[1,10606560191089577674,17895866689572679819]
9997:
[1,8809232686506786849]
9998:
[1]
9999:
[1]
10000:
[1,11792005616768216715]
The exponents from about 10 to 63 take longest (individually, not cumulative), there's a remarkable speedup from exponent 64 on due to the reduced search space.
Here is a brute force solution that will solve for all such n, including 1 and any other n greater than the first within whatever range you choose (in this case I chose base^q as my range limit). You could modify to ignore the special case of 1 and also to return after the first result. It's in C#, but might look nicer in a language with a ** exponentiation operator. You could also pass in your q and base as parameters.
int q = 5;
int radix = 10;
for (int input = 1; input < (int)Math.Pow(radix, q); input++)
{
int sum = 0;
for (int i = 1; i < (int)Math.Pow(radix, q); i *= radix)
{
int x = input / i % radix; //get current digit
sum += (int)Math.Pow(x, q); //x**q;
}
if (sum == input)
{
Console.WriteLine("Hooray: {0}", input);
}
}
So, for q = 5 the results are:
Hooray: 1
Hooray: 4150
Hooray: 4151
Hooray: 54748
Hooray: 92727
Hooray: 93084

Generate Random(a, b) making calls to Random(0, 1)

There is known Random(0,1) function, it is a uniformed random function, which means, it will give 0 or 1, with probability 50%. Implement Random(a, b) that only makes calls to Random(0,1)
What I though so far is, put the range a-b in a 0 based array, then I have index 0, 1, 2...b-a.
then call the RANDOM(0,1) b-a times, sum the results as generated idx. and return the element.
However since there is no answer in the book, I don't know if this way is correct or the best. How to prove that the probability of returning each element is exactly same and is 1/(b-a+1) ?
And what is the right/better way to do this?
If your RANDOM(0, 1) returns either 0 or 1, each with probability 0.5 then you can generate bits until you have enough to represent the number (b-a+1) in binary. This gives you a random number in a slightly too large range: you can test and repeat if it fails. Something like this (in Python).
def rand_pow2(bit_count):
"""Return a random number with the given number of bits."""
result = 0
for i in xrange(bit_count):
result = 2 * result + RANDOM(0, 1)
return result
def random_range(a, b):
"""Return a random integer in the closed interval [a, b]."""
bit_count = math.ceil(math.log2(b - a + 1))
while True:
r = rand_pow2(bit_count)
if a + r <= b:
return a + r
When you sum random numbers, the result is not longer evenly distributed - it looks like a Gaussian function. Look up "law of large numbers" or read any probability book / article. Just like flipping coins 100 times is highly highly unlikely to give 100 heads. It's likely to give close to 50 heads and 50 tails.
Your inclination to put the range from 0 to a-b first is correct. However, you cannot do it as you stated. This question asks exactly how to do that, and the answer utilizes unique factorization. Write m=a-b in base 2, keeping track of the largest needed exponent, say e. Then, find the biggest multiple of m that is smaller than 2^e, call it k. Finally, generate e numbers with RANDOM(0,1), take them as the base 2 expansion of some number x, if x < k*m, return x, otherwise try again. The program looks something like this (simple case when m<2^2):
int RANDOM(0,m) {
// find largest power of n needed to write m in base 2
int e=0;
while (m > 2^e) {
++e;
}
// find largest multiple of m less than 2^e
int k=1;
while (k*m < 2^2) {
++k
}
--k; // we went one too far
while (1) {
// generate a random number in base 2
int x = 0;
for (int i=0; i<e; ++i) {
x = x*2 + RANDOM(0,1);
}
// if x isn't too large, return it x modulo m
if (x < m*k)
return (x % m);
}
}
Now you can simply add a to the result to get uniformly distributed numbers between a and b.
Divide and conquer could help us in generating a random number in range [a,b] using random(0,1). The idea is
if a is equal to b, then random number is a
Find mid of the range [a,b]
Generate random(0,1)
If above is 0, return a random number in range [a,mid] using recursion
else return a random number in range [mid+1, b] using recursion
The working 'C' code is as follows.
int random(int a, int b)
{
if(a == b)
return a;
int c = RANDOM(0,1); // Returns 0 or 1 with probability 0.5
int mid = a + (b-a)/2;
if(c == 0)
return random(a, mid);
else
return random(mid + 1, b);
}
If you have a RNG that returns {0, 1} with equal probability, you can easily create a RNG that returns numbers {0, 2^n} with equal probability.
To do this you just use your original RNG n times and get a binary number like 0010110111. Each of the numbers are (from 0 to 2^n) are equally likely.
Now it is easy to get a RNG from a to b, where b - a = 2^n. You just create a previous RNG and add a to it.
Now the last question is what should you do if b-a is not 2^n?
Good thing that you have to do almost nothing. Relying on rejection sampling technique. It tells you that if you have a big set and have a RNG over that set and need to select an element from a subset of this set, you can just keep selecting an element from a bigger set and discarding them till they exist in your subset.
So all you do, is find b-a and find the first n such that b-a <= 2^n. Then using rejection sampling till you picked an element smaller b-a. Than you just add a.

algorithm to find closest string using same characters

Given a list L of n character strings, and an input character string S, what is an efficient way to find the character string in L that contains the most characters that exist in S? We want to find the string in L that is most-closely made up of the letters contained in S.
The obvious answer is to loop through all n strings and check to see how many characters in the current string exist in S. However, this algorithm will be run frequently, and the list L of n string will be stored in a database... loop manually through all n strings would require something like big-Oh of n*m^2, where n is the number of strings in L, and m is the max length of any string in L, as well as the max length of S... in this case m is actually a constant of 150.
Is there a better way than just a simple loop? Is there a data structure I can load the n strings into that would give me fast search ability? Is there an algorithm that uses the pre-calculated meta-data about each of the n strings that would perform better than a loop?
I know there are a lot of geeks out there that are into the algorithms. So please help!
Thanks!
If you are after substrings, a Trie or Patrica trie might be a good starting point.
If you don't care about the order, just about the number of each symbol or letter, I would calculate the histogram of all strings and then compare them with the histogram of the input.
ABCDEFGHIJKLMNOPQRSTUVWXYZ
Hello World => ...11..1...3..2..1....1...
This will lower the costs to O(26 * m + n) plus the preprocessing once if you consider only case-insensitive latin letters.
If m is constant, you could interpret the histogram as a 26 dimensional vector on a 26 dimensional unit sphere by normalizing it. Then you could just calculate the Dot Product of two vectors yielding the cosine of the angle between the two vectors, and this value should be proportional to the similarity of the strings.
Assuming m = 3, a alphabet A = { 'U', 'V', 'W' } of size three only, and the following list of strings.
L = { "UUU", "UVW", "WUU" }
The histograms are the following.
H = { (3, 0, 0), (1, 1, 1), (2, 0, 1) }
A histogram h = (x, y, z) is normalized to h' = (x/r, y/r, z/r) with r the Euclidian norm of the histogram h - that is r = sqrt(x² + y² + z²).
H' = { (1.000, 0.000, 0.000), (0.577, 0.577, 0.577), (0.894, 0.000, 0.447) }
The input S = "VVW" has the histogram hs = (0, 2, 1) and the normalized histogram hs' = (0.000, 0.894, 0.447).
Now we can calculate the similarity of two histograms h1 = (a, b, c) and h2 = (x, y, z) as the Euclidian distance of both histograms.
d(h1, h2) = sqrt((a - x)² + (b - y)² + (c - z)²)
For the example we obtain.
d((3, 0, 0), (0, 2, 1)) = 3.742
d((1, 1, 1), (0, 2, 1)) = 1.414
d((2, 0, 1), (0, 2, 1)) = 2.828
Hence "UVW" is closest to "VVW" (smaller numbers indicate higher similarity).
Using the normalized histograms h1' = (a', b', c') and h2' = (x', y', z') we can calculate the distance as the dot product of both histograms.
d'(h1', h2') = a'x' + b'y' + c'z'
For the example we obtain.
d'((1.000, 0.000, 0.000), (0.000, 0.894, 0.447)) = 0.000
d'((0.577, 0.577, 0.577), (0.000, 0.894, 0.447)) = 0.774
d'((0.894, 0.000, 0.447), (0.000, 0.894, 0.447)) = 0.200
Again "UVW" is determined to be closest to "VVW" (larger numbers indicate higher similarity).
Both version yield different numbers, but the results are always the same. One could also use other norms - Manhattan distance (L1 norm) for example - but this will only change the numbers because norms in finite dimensional vector spaces are all equivalent.
Sounds like you need a trie. Tries are used to search for words similar to the way a spell checker will work. So if the String S has the characters in the same order as the Strings in L then this may work for you.
If however, the order of the characters in S is not relevant - like a set of scrabble tiles and you want to search for the longest word - then this is not your solution.
What you want is a BK-Tree. It's a bit unintuitive, but very cool - and it makes it possible to search for elements within a levenshtein (edit) distance threshold in O(log n) time.
If you care about ordering in your input strings, use them as is. If you don't you can sort the individual characters before inserting them into the BK-Tree (or querying with them).
I believe what you're looking for can be found here: Fuzzy Logic Based Search Technique
It's pretty heavy, but so is what you're asking for. It talks about word similarities, and character misplacement.
i.e:
L I N E A R T R N A S F O R M
L I N A E R T R A N S F O R M
L E N E A R T R A N S F R M
it seems to me that the order of the characters is not important in your problem, but you are searching for "near-anagrams" of the word S.
If that's so, then you can represent every word in the set L as an array of 26 integers (assuming your alphabet has 26 letters). You can represent S similarly as an array of 26 integers; now to find the best match you just run once through the set L and calculate a distance metric between the S-vector and the current L-vector, however you want to define the distance metric (e.g. euclidean / sum-of-squares or Manhattan / sum of absolute differences). This is O(n) algorithm because the vectors have constant lengths.
Here is a T-SQL function that has been working great for me, gives you the edit distance:
Example:
SELECT TOP 1 [StringValue] , edit_distance([StringValue, 'Input Value')
FROM [SomeTable]
ORDER BY edit_distance([StringValue, 'Input Value')
The Function:
CREATE FUNCTION edit_distance(#s1 nvarchar(3999), #s2 nvarchar(3999))
RETURNS int
AS
BEGIN
DECLARE #s1_len int, #s2_len int, #i int, #j int, #s1_char nchar, #c int, #c_temp int,
#cv0 varbinary(8000), #cv1 varbinary(8000)
SELECT #s1_len = LEN(#s1), #s2_len = LEN(#s2), #cv1 = 0x0000, #j = 1, #i = 1, #c = 0
WHILE #j <= #s2_len
SELECT #cv1 = #cv1 + CAST(#j AS binary(2)), #j = #j + 1
WHILE #i <= #s1_len
BEGIN
SELECT #s1_char = SUBSTRING(#s1, #i, 1), #c = #i, #cv0 = CAST(#i AS binary(2)), #j = 1
WHILE #j <= #s2_len
BEGIN
SET #c = #c + 1
SET #c_temp = CAST(SUBSTRING(#cv1, #j+#j-1, 2) AS int) +
CASE WHEN #s1_char = SUBSTRING(#s2, #j, 1) THEN 0 ELSE 1 END
IF #c > #c_temp SET #c = #c_temp
SET #c_temp = CAST(SUBSTRING(#cv1, #j+#j+1, 2) AS int)+1
IF #c > #c_temp SET #c = #c_temp
SELECT #cv0 = #cv0 + CAST(#c AS binary(2)), #j = #j + 1
END
SELECT #cv1 = #cv0, #i = #i + 1
END
RETURN #c
END

Resources