How can I generate de Bruijn sequences iteratively? - algorithm

I am looking for a way to generate a de Bruijn sequence iteratively instead of with recursion. My goal is to generate it character by character.
I found some example code in Python for generating de Bruijn sequences and translated it into Rust. I am not yet able to comprehend this technique well enough to create my own method.
Translated into Rust:
fn gen(sequence: &mut Vec<usize>, a: &mut [usize], t: usize, p: usize, k: usize, n: usize) {
if t > n {
if n % p == 0 {
for x in 1..(p + 1) {
sequence.push(a[x])
}
}
} else {
a[t] = a[t - p];
gen(sequence, a, t + 1, p, k, n);
for x in (a[t - p] + 1)..k {
a[t] = x;
gen(sequence, a, t + 1, t, k, n);
}
}
}
fn de_bruijn<T: Clone>(alphabet: &[T], n: usize) -> Vec<T> {
let k = alphabet.len();
let mut a = vec![0; n + 1];
let vecsize = k.checked_pow(n as u32).unwrap();
let mut sequence = Vec::with_capacity(vecsize);
gen(&mut sequence, &mut a, 1, 1, k, n);
sequence.into_iter().map(|x| alphabet[x].clone()).collect()
}
However this is not able to generate iteratively - it goes through a whole mess of recursion and iteration which is impossible to untangle into a single state.

Consider this approach:
Choose the first (lexicographically) representative from every necklace class
Here is Python code for generation of representatives for (binary) necklaces containing d ones (it is possible to repeat for all d values). Sawada article link
Sort representatives in lexicographic order
Make periodic reduction for every representative (if possible): if string is periodic s = p^m like 010101, choose 01
To find the period, it is possible to use string doubling or z-algorithm (I expect it's faster for compiled languages)
Concatenate reductions
Example for n=3,k=2:
Sorted representatives: 000, 001, 011, 111
Reductions: 0, 001, 011, 1
Result: 00010111
The same essential method (with C code) is described in Jörg Arndt's book "Matters Computational", chapter 18
A similar way is mentioned in the wiki
An alternative construction involves concatenating together, in
lexicographic order, all the Lyndon words whose length divides n
You might look for effective way to generate appropriate Lyndon words

I am not familiar with Rust, so I programmed and tested it in Python. Since the poster translated the version in the question from a Python program, I hope it will not be a big issue.
# the following function treats list a as
# k-adic number with n digtis
# and increments this number returning
# the index of the leftmost digit changed
def increment_a7(a, k, n):
digit= n-1
a[digit]+= 1
while a[digit] >= k and digit> 0:
#a[digit]= 0
a[digit]= a[0]+1
a[digit-1]+= 1
digit-= 1
return digit
# the following function adds a to the sequence
# and takes into account, that the beginning of a
# could overlap with the end of sequence
# in that case, it just removes the overlapping digits
# from a before adding the remaining digits to sequence
def append_to_sequence(sequence, a, n):
# here we can assume safely, that a
# does not overlap completely with sequence[-n:]
i= -1
for i in range(n-1, -1, -1):
found= True
# check if the last i digits in sequence
# overlap with the first i digits in a
for j in range(i):
if a[j] != sequence[-i+j]:
# no, they don't overlap
found= False
break
if found:
# yes they overlap, so no need to
# continue the check with a smaller i
break
# now we can just append everything from
# digit i (digit 0 - i-1 are swallowed)
sequence.extend(a[i:])
return n-i
# during the operation we have to keep track of
# the k-adic numbers a, that already occured in
# the sequence. We store them in a set called used
# everytime we add something to the sequence
# we have to update it and add one entry for each
# digit inserted
def update_used(sequence, used, n, num_inserted):
l= len(sequence)
for i in range(num_inserted):
used.add(tuple(sequence[-n-i:l-i]))
# the main work is done in the following function
# it creates and returns the generated sequence
def gen4(k, n):
a= [0]*n
sequence= a[:]
used= set()
# create a fake sequence to add the segments obtained by the cyclic nature
fake= ([k-1] * (n-1))
for i in range(n-1):
fake.append(0)
update_used(fake, used, n, 1)
update_used(sequence, used, n, 1)
valid= True
while valid:
# a is still a valid k-adic number
# this means the generation process
# has not ended
# so construct a new number from the n-1
# last digits of sequence
# followed by a zero
a= sequence[-n+1:]
a.append(0)
while valid and tuple(a) in used:
# the constructed k-adict number a
# was already used, so increment it
# and try again
increment_a(a, k, n)
valid= a[0]<k
if valid:
# great, the number is still valid
# and is not jet part of the sequence
# so add it after removing the overlapping
# digits and update the set with the segments
# we already used
num_inserted= append_to_sequence(sequence, a, n)
update_used(sequence, used, n, num_inserted)
return sequence
I tested the code above by generating some sequences with the original version of gen and this one using the same parameters. For all sets of parameters I tested, the result was the same in both versions.
Please note that this code is less efficient than the original version, especially if the sequence gets long. I guess the costs of the set operations has a non-linear influence on the runtime.
If you like, you can improve it further such as by using a more efficient way to store the used segments. Instead of operating on the k-adic representation (the a-list), you could use a multidimensional array instead.

Related

Efficient partial permutation sort in Julia

I am dealing with a problem that requires a partial permutation sort by magnitude in Julia. If x is a vector of dimension p, then what I need are the first k indices corresponding to the k components of x that would appear first in a partial sort by absolute value of x.
Refer to Julia's sorting functions here. Basically, I want a cross between sortperm and select!. When Julia 0.4 is released, I will be able to obtain the same answer by applying sortperm! (this function) to the vector of indices and choosing the first k of them. However, using sortperm! is not ideal here because it will sort the remaining p-k indices of x, which I do not need.
What would be the most memory-efficient way to do the partial permutation sort? I hacked a solution by looking at the sortperm source code. However, since I am not versed in the ordering modules that Julia uses there, I am not sure if my approach is intelligent.
One important detail: I can ignore repeats or ambiguities here. In other words, I do not care about the ordering by abs() of indices for two components 2 and -2. My actual code uses floating point values, so exact equality never occurs for practical purposes.
# initialize a vector for testing
x = [-3,-2,4,1,0,-1]
x2 = copy(x)
k = 3 # num components desired in partial sort
p = 6 # num components in x, x2
# what are the indices that sort x by magnitude?
indices = sortperm(x, by = abs, rev = true)
# now perform partial sort on x2
select!(x2, k, by = abs, rev = true)
# check if first k components are sorted here
# should evaluate to "true"
isequal(x2[1:k], x[indices[1:k]])
# now try my partial permutation sort
# I only need indices2[1:k] at end of day!
indices2 = [1:p]
select!(indices2, 1:k, 1, p, Base.Perm(Base.ord(isless, abs, true, Base.Forward), x))
# same result? should evaluate to "true"
isequal(indices2[1:k], indices[1:k])
EDIT: With the suggested code, we can briefly compare performance on much larger vectors:
p = 10000; k = 100; # asking for largest 1% of components
x = randn(p); x2 = copy(x);
# run following code twice for proper timing results
#time {indices = sortperm(x, by = abs, rev = true); indices[1:k]};
#time {indices2 = [1:p]; select!(indices2, 1:k, 1, p, Base.Perm(Base.ord(isless, abs, true, Base.Forward), x))};
#time selectperm(x,k);
My output:
elapsed time: 0.048876901 seconds (19792096 bytes allocated)
elapsed time: 0.007016534 seconds (2203688 bytes allocated)
elapsed time: 0.004471847 seconds (1657808 bytes allocated)
The following version appears to be relatively space-efficient because it uses only an integer array of the same length as the input array:
function selectperm (x,k)
if k > 1 then
kk = 1:k
else
kk = 1
end
z = collect(1:length(x))
return select!(z,1:k,by = (i)->abs(x[i]), rev = true)
end
x = [-3,-2,4,1,0,-1]
k = 3 # num components desired in partial sort
print (selectperm(x,k))
The output is:
[3,1,2]
... as expected.
I'm not sure if it uses less memory than the originally-proposed solution (though I suspect the memory usage is similar) but the code may be clearer and it does produce only the first k indices whereas the original solution produced all p indices.
(Edit)
selectperm() has been edited to deal with the BoundsError that occurs if k=1 in the call to select!().

Better Algorithm to find the maximum number who's square divides K :

Given a number K which is a product of two different numbers (A,B), find the maximum number(<=A & <=B) who's square divides the K .
Eg : K = 54 (6*9) . Both the numbers are available i.e 6 and 9.
My approach is fairly very simple or trivial.
taking the smallest of the two ( 6 in this case).Lets say A
Square the number and divide K, if its a perfect division, that's the number.
Else A = A-1 ,till A =1.
For the given example, 3*3 = 9 divides K, and hence 3 is the answer.
Looking for a better algorithm, than the trivial solution.
Note : The test cases are in 1000's so the best possible approach is needed.
I am sure someone else will come up with a nice answer involving modulus arithmetic. Here is a naive approach...
Each of the factors can themselves be factored (though it might be an expensive operation).
Given the factors, you can then look for groups of repeated factors.
For instance, using your example:
Prime factors of 9: 3, 3
Prime factors of 6: 2, 3
All prime factors: 2, 3, 3, 3
There are two 3s, so you have your answer (the square of 3 divides 54).
Second example of 36 x 9 = 324
Prime factors of 36: 2, 2, 3, 3
Prime factors of 9: 3, 3
All prime factors: 2, 2, 3, 3, 3, 3
So you have two 2s and four 3s, which means 2x3x3 is repeated. 2x3x3 = 18, so the square of 18 divides 324.
Edit: python prototype
import math
def factors(num, dict):
""" This finds the factors of a number recursively.
It is not the most efficient algorithm, and I
have not tested it a lot. You should probably
use another one. dict is a dictionary which looks
like {factor: occurrences, factor: occurrences, ...}
It must contain at least {2: 0} but need not have
any other pre-populated elements. Factors will be added
to this dictionary as they are found.
"""
while (num % 2 == 0):
num /= 2
dict[2] += 1
i = 3
found = False
while (not found and (i <= int(math.sqrt(num)))):
if (num % i == 0):
found = True
factors(i, dict)
factors(num / i, dict)
else:
i += 2
if (not found):
if (num in dict.keys()):
dict[num] += 1
else:
dict[num] = 1
return 0
#MAIN ROUTINE IS HERE
n1 = 37 # first number (6 in your example)
n2 = 41 # second number (9 in your example)
dict = {2: 0} # initialise factors (start with "no factors of 2")
factors(n1, dict) # find the factors of f1 and add them to the list
factors(n2, dict) # find the factors of f2 and add them to the list
sqfac = 1
# now find all factors repeated twice and multiply them together
for k in dict.keys():
dict[k] /= 2
sqfac *= k ** dict[k]
# here is the result
print(sqfac)
Answer in C++
int func(int i, j)
{
int k = 54
float result = pow(i, 2)/k
if (static_cast<int>(result)) == result)
{
if(i < j)
{
func(j, i);
}
else
{
cout << "Number is correct: " << i << endl;
}
}
else
{
cout << "Number is wrong" << endl;
func(j, i)
}
}
Explanation:
First recursion then test if result is a positive integer if it is then check if the other multiple is less or greater if greater recursive function tries the other multiple and if not then it is correct. Then if result is not positive integer then print Number is wrong and do another recursive function to test j.
If I got the problem correctly, I see that you have a rectangle of length=A, width=B, and area=K
And you want convert it to a square and lose the minimum possible area
If this is the case. So the problem with your algorithm is not the cost of iterating through mutliple iterations till get the output.
Rather the problem is that your algorithm depends heavily on the length A and width B of the input rectangle.
While it should depend only on the area K
For example:
Assume A =1, B=25
Then K=25 (the rect area)
Your algorithm will take the minimum value, which is A and accept it as answer with a single
iteration which is so fast but leads to wrong asnwer as it will result in a square of area 1 and waste the remaining 24 (whatever cm
or m)
While the correct answer here should be 5. which will never be reached by your algorithm
So, in my solution I assume a single input K
My ideas is as follows
x = sqrt(K)
if(x is int) .. x is the answer
else loop from x-1 till 1, x--
if K/x^2 is int, x is the answer
This might take extra iterations but will guarantee accurate answer
Also, there might be some concerns on the cost of sqrt(K)
but it will be called just once to avoid misleading length and width input

Find next prime given all prior

I'm writing a recursive infinite prime number generator, and I'm almost sure I can optimize it better.
Right now, aside from a lookup table of the first dozen primes, each call to the recursive function receives a list of all previous primes.
Since it's a lazy generator, right now I'm just filtering out any number that is modulo 0 for any of the previous primes, and taking the first unfiltered result. (The check I'm using short-circuits, so the first time a previous prime divides the current number evenly it aborts with that information.)
Right now, my performance degrades around when searching for the 400th prime (37,813). I'm looking for ways to use the unique fact that I have a list of all prior primes, and am only searching for the next, to improve my filtering algorithm. (Most information I can find offers non-lazy sieves to find primes under a limit, or ways to find the pnth prime given pn-1, not optimizations to find pn given 2...pn-1 primes.)
For example, I know that the pnth prime must reside in the range (pn-1 + 1)...(pn-1+pn-2). Right now I start my filtering of integers at pn-1 + 2 (since pn-1 + 1 can only be prime for pn-1 = 2, which is precomputed). But since this is a lazy generator, knowing the terminal bounds of the range (pn-1+pn-2) doesn't help me filter anything.
What can I do to filter more effectively given all previous primes?
Code Sample
#doc """
Creates an infinite stream of prime numbers.
iex> Enum.take(primes, 5)
[2, 3, 5, 7, 11]
iex> Enum.take_while(primes, fn(n) -> n < 25 end)
[2, 3, 5, 7, 11, 13, 17, 19, 23]
"""
#spec primes :: Stream.t
def primes do
Stream.unfold( [], fn primes ->
next = next_prime(primes)
{ next, [next | primes] }
end )
end
defp next_prime([]), do: 2
defp next_prime([2 | _]), do: 3
defp next_prime([3 | _]), do: 5
defp next_prime([5 | _]), do: 7
# ... etc
defp next_prime(primes) do
start = Enum.first(primes) + 2
Enum.first(
Stream.drop_while(
Integer.stream(from: start, step: 2),
fn number ->
Enum.any?(primes, fn prime ->
rem(number, prime) == 0
end )
end
)
)
end
The primes function starts with an empty array, gets the next prime for it (2 initially), and then 1) emits it from the Stream and 2) Adds it to the top the primes stack used in the next call. (I'm sure this stack is the source of some slowdown.)
The next_primes function takes in that stack. Starting from the last known prime+2, it creates an infinite stream of integers, and drops each integer that divides evenly by any known prime for the list, and then returns the first occurrence.
This is, I suppose, something similar to a lazy incremental Eratosthenes's sieve.
You can see some basic attempts at optimization: I start checking at pn-1+2, and I step over even numbers.
I tried a more verbatim Eratosthenes's sieve by just passing the Integer.stream through each calculation, and after finding a prime, wrapping the Integer.stream in a new Stream.drop_while that filtered just multiples of that prime out. But since Streams are implemented as anonymous functions, that mutilated the call stack.
It's worth noting that I'm not assuming you need all prior primes to generate the next one. I just happen to have them around, thanks to my implementation.
For any number k you only need to try division with primes up to and including √k. This is because any prime larger than √k would need to be multiplied with a prime smaller than √k.
Proof:
√k * √k = k so (a+√k) * √k > k (for all 0<a<(k-√k)). From this follows that (a+√k) divides k iff there is another divisor smaller than √k.
This is commonly used to speed up finding primes tremendously.
You don't need all prior primes, just those below the square root of your current production point are enough, when generating composites from primes by the sieve of Eratosthenes algorithm.
This greatly reduces the memory requirements. The primes are then simply those odd numbers which are not among the composites.
Each prime p produces a chain of its multiples, starting from its square, enumerated with the step of 2p (because we work only with odd numbers). These multiples, each with its step value, are stored in a dictionary, thus forming a priority queue. Only the primes up to the square root of the current candidate are present in this priority queue (the same memory requirement as that of a segmented sieve of E.).
Symbolically, the sieve of Eratosthenes is
P = {3,5,7,9, ...} \ &bigcup; {{p2, p2+2p, p2+4p, p2+6p, ...} | p in P}
Each odd prime generates a stream of its multiples by repeated addition; all these streams merged together give us all the odd composites; and primes are all the odd numbers without the composites (and the one even prime number, 2).
In Python (can be read as an executable pseudocode, hopefully),
def postponed_sieve(): # postponed sieve, by Will Ness,
yield 2; yield 3; # https://stackoverflow.com/a/10733621/849891
yield 5; yield 7; # original code David Eppstein / Alex Martelli
D = {} # 2002, http://code.activestate.com/recipes/117119
ps = (p for p in postponed_sieve()) # a separate Primes Supply:
p = ps.next() and ps.next() # (3) a Prime to add to dict
q = p*p # (9) when its sQuare is
c = 9 # the next Candidate
while True:
if c not in D: # not a multiple of any prime seen so far:
if c < q: yield c # a prime, or
else: # (c==q): # the next prime's square:
add(D,c + 2*p,2*p) # (9+6,6 : 15,21,27,33,...)
p=ps.next() # (5)
q=p*p # (25)
else: # 'c' is a composite:
s = D.pop(c) # step of increment
add(D,c + s,s) # next multiple, same step
c += 2 # next odd candidate
def add(D,x,s): # make no multiple keys in Dict
while x in D: x += s # increment by the given step
D[x] = s
Once a prime is produced, it can be forgotten. A separate prime supply is taken from a separate invocation of the same generator, recursively, to maintain the dictionary. And the prime supply for that one is taken from another, recursively as well. Each needs to be supplied only up to the square root of its production point, so very few generators are needed overall (on the order of log log N generators), and their sizes are asymptotically insignificant (sqrt(N), sqrt( sqrt(N) ), etc).
I wrote a program that generates the prime numbers in order, without limit, and used it to sum the first billion primes at my blog. The algorithm uses a segmented Sieve of Eratosthenes; additional sieving primes are calculated at each segment, so the process can continue indefinitely, as long as you have space to store the sieving primes. Here's pseudocode:
function init(delta) # Sieve of Eratosthenes
m, ps, qs := 0, [], []
sieve := makeArray(2 * delta, True)
for p from 2 to delta
if sieve[p]
m := m + 1; ps.insert(p)
qs.insert(p + (p-1) / 2)
for i from p+p to n step p
sieve[i] := False
return m, ps, qs, sieve
function advance(m, ps, qs, sieve, bottom, delta)
for i from 0 to delta - 1
sieve[i] := True
for i from 0 to m - 1
qs[i] := (qs[i] - delta) % ps[i]
p := ps[0] + 2
while p * p <= bottom + 2 * delta
if isPrime(p) # trial division
m := m + 1; ps.insert(p)
qs.insert((p*p - bottom - 1) / 2)
p := p + 2
for i from 0 to m - 1
for j from qs[i] to delta step ps[i]
sieve[j] := False
return m, ps, qs, sieve
Here ps is the list of sieving primes less than the current maximum and qs is the offset of the smallest multiple of the corresponding ps in the current segment. The advance function clears the bitarray, resets qs, extends ps and qs with new sieving primes, then sieves the next segment.
function genPrimes()
bottom, i, delta := 0, 1, 50000
m, ps, qs, sieve := init(delta)
yield 2
while True
if i == delta # reset for next segment
i, bottom := -1, bottom + 2 * delta
m, ps, qs, sieve := \textbackslash
advance(m, ps, qs, sieve, bottom, delta)
else if sieve[i] # found prime
yield bottom + 2*i + 1
i := i + 1
The segment size 2 * delta is arbitrarily set to 100000. This method requires O(sqrt(n)) space for the sieving primes plus constant space for the sieve.
It is slower but saves space to generate candidates with a wheel and test the candidates for primality.
function genPrimes()
w, wheel := 0, [1,2,2,4,2,4,2,4,6,2,6,4,2,4, \
6,6,2,6,4,2,6,4,6,8,4,2,4,2,4,8,6,4,6, \
2,4,6,2,6,6,4,2,4,6,2,6,4,2,4,2,10,2,10]
p := 2; yield p
repeat
p := p + wheel[w]
if w == 51 then w := 4 else w := w + 1
if isPrime(p) yield p
It may be useful to begin with a sieve and switch to a wheel when the sieve grows too large. Even better is to continue sieving with some fixed set of sieving primes, once the set grows too large, then report only those values bottom + 2*i + 1 that pass a primality test.

Working with arbitrary inequalities and checking which, if any, are satisfied

Given a non-negative integer n and an arbitrary set of inequalities that are user-defined (in say an external text file), I want to determine whether n satisfies any inequality, and if so, which one(s).
Here is a points list.
n = 0: 1
n < 5: 5
n = 5: 10
If you draw a number n that's equal to 5, you get 10 points.
If n less than 5, you get 5 points.
If n is 0, you get 1 point.
The stuff left of the colon is the "condition", while the stuff on the right is the "value".
All entries will be of the form:
n1 op n2: val
In this system, equality takes precedence over inequality, so the order that they appear in will not matter in the end. The inputs are non-negative integers, though intermediary and results may not be non-negative. The results may not even be numbers (eg: could be strings). I have designed it so that will only accept the most basic inequalities, to make it easier for writing a parser (and to see whether this idea is feasible)
My program has two components:
a parser that will read structured input and build a data structure to store the conditions and their associated results.
a function that will take an argument (a non-negative integer) and return the result (or, as in the example, the number of points I receive)
If the list was hardcoded, that is an easy task: just use a case-when or if-else block and I'm done. But the problem isn't as easy as that.
Recall the list at the top. It can contain an arbitrary number of (in)equalities. Perhaps there's only 3 like above. Maybe there are none, or maybe there are 10, 20, 50, or even 1000000. Essentially, you can have m inequalities, for m >= 0
Given a number n and a data structure containing an arbitrary number of conditions and results, I want to be able to determine whether it satisfies any of the conditions and return the associated value. So as with the example above, if I pass in 5, the function will return 10.
They condition/value pairs are not unique in their raw form. You may have multiple instances of the same (in)equality but with different values. eg:
n = 0: 10
n = 0: 1000
n > 0: n
Notice the last entry: if n is greater than 0, then it is just whatever you got.
If multiple inequalities are satisfied (eg: n > 5, n > 6, n > 7), all of them should be returned. If that is not possible to do efficiently, I can return just the first one that satisfied it and ignore the rest. But I would like to be able to retrieve the entire list.
I've been thinking about this for a while and I'm thinking I should use two hash tables: the first one will store the equalities, while the second will store the inequalities.
Equality is easy enough to handle: Just grab the condition as a key and have a list of values. Then I can quickly check whether n is in the hash and grab the appropriate value.
However, for inequality, I am not sure how it will work. Does anyone have any ideas how I can solve this problem in as little computational steps as possible? It's clear that I can easily accomplish this in O(n) time: just run it through each (in)equality one by one. But what happens if this checking is done in real-time? (eg: updated constantly)
For example, it is pretty clear that if I have 100 inequalities and 99 of them check for values > 100 while the other one checks for value <= 100, I shouldn't have to bother checking those 99 inequalities when I pass in 47.
You may use any data structure to store the data. The parser itself is not included in the calculation because that will be pre-processed and only needs to be done once, but if it may be problematic if it takes too long to parse the data.
Since I am using Ruby, I likely have more flexible options when it comes to "messing around" with the data and how it will be interpreted.
class RuleSet
Rule = Struct.new(:op1,:op,:op2,:result) do
def <=>(r2)
# Op of "=" sorts before others
[op=="=" ? 0 : 1, op2.to_i] <=> [r2.op=="=" ? 0 : 1, r2.op2.to_i]
end
def matches(n)
#op2i ||= op2.to_i
case op
when "=" then n == #op2i
when "<" then n < #op2i
when ">" then n > #op2i
end
end
end
def initialize(text)
#rules = text.each_line.map do |line|
Rule.new *line.split(/[\s:]+/)
end.sort
end
def value_for( n )
if rule = #rules.find{ |r| r.matches(n) }
rule.result=="n" ? n : rule.result.to_i
end
end
end
set = RuleSet.new( DATA.read )
-1.upto(8) do |n|
puts "%2i => %s" % [ n, set.value_for(n).inspect ]
end
#=> -1 => 5
#=> 0 => 1
#=> 1 => 5
#=> 2 => 5
#=> 3 => 5
#=> 4 => 5
#=> 5 => 10
#=> 6 => nil
#=> 7 => 7
#=> 8 => nil
__END__
n = 0: 1
n < 5: 5
n = 5: 10
n = 7: n
I would parse the input lines and separate them into predicate/result pairs and build a hash of callable procedures (using eval - oh noes!). The "check" function can iterate through each predicate and return the associated result when one is true:
class PointChecker
def initialize(input)
#predicates = Hash[input.split(/\r?\n/).map do |line|
parts = line.split(/\s*:\s*/)
[Proc.new {|n| eval(parts[0].sub(/=/,'=='))}, parts[1].to_i]
end]
end
def check(n)
#predicates.map { |p,r| [p.call(n) ? r : nil] }.compact
end
end
Here is sample usage:
p = PointChecker.new <<__HERE__
n = 0: 1
n = 1: 2
n < 5: 5
n = 5: 10
__HERE__
p.check(0) # => [1, 5]
p.check(1) # => [2, 5]
p.check(2) # => [5]
p.check(5) # => [10]
p.check(6) # => []
Of course, there are many issues with this implementation. I'm just offering a proof-of-concept. Depending on the scope of your application you might want to build a proper parser and runtime (instead of using eval), handle input more generally/gracefully, etc.
I'm not spending a lot of time on your problem, but here's my quick thought:
Since the points list is always in the format n1 op n2: val, I'd just model the points as an array of hashes.
So first step is to parse the input point list into the data structure, an array of hashes.
Each hash would have values n1, op, n2, value
Then, for each data input you run through all of the hashes (all of the points) and handle each (determining if it matches to the input data or not).
Some tricks of the trade
Spend time in your parser handling bad input. Eg
n < = 1000 # no colon
n < : 1000 # missing n2
x < 2 : 10 # n1, n2 and val are either number or "n"
n # too short, missing :, n2, val
n < 1 : 10x # val is not a number and is not "n"
etc
Also politely handle non-numeric input data
Added
Re: n1 doesn't matter. Be careful, this could be a trick. Why wouldn't
5 < n : 30
be a valid points list item?
Re: multiple arrays of hashes, one array per operator, one hash per point list item -- sure that's fine. Since each op is handled in a specific way, handling the operators one by one is fine. But....ordering then becomes an issue:
Since you want multiple results returned from multiple matching point list items, you need to maintain the overall order of them. Thus I think one array of all the point lists would be the easiest way to do this.

Select k random elements from a list whose elements have weights

Selecting without any weights (equal probabilities) is beautifully described here.
I was wondering if there is a way to convert this approach to a weighted one.
I am also interested in other approaches as well.
Update: Sampling without replacement
If the sampling is with replacement, you can use this algorithm (implemented here in Python):
import random
items = [(10, "low"),
(100, "mid"),
(890, "large")]
def weighted_sample(items, n):
total = float(sum(w for w, v in items))
i = 0
w, v = items[0]
while n:
x = total * (1 - random.random() ** (1.0 / n))
total -= x
while x > w:
x -= w
i += 1
w, v = items[i]
w -= x
yield v
n -= 1
This is O(n + m) where m is the number of items.
Why does this work? It is based on the following algorithm:
def n_random_numbers_decreasing(v, n):
"""Like reversed(sorted(v * random() for i in range(n))),
but faster because we avoid sorting."""
while n:
v *= random.random() ** (1.0 / n)
yield v
n -= 1
The function weighted_sample is just this algorithm fused with a walk of the items list to pick out the items selected by those random numbers.
This in turn works because the probability that n random numbers 0..v will all happen to be less than z is P = (z/v)n. Solve for z, and you get z = vP1/n. Substituting a random number for P picks the largest number with the correct distribution; and we can just repeat the process to select all the other numbers.
If the sampling is without replacement, you can put all the items into a binary heap, where each node caches the total of the weights of all items in that subheap. Building the heap is O(m). Selecting a random item from the heap, respecting the weights, is O(log m). Removing that item and updating the cached totals is also O(log m). So you can pick n items in O(m + n log m) time.
(Note: "weight" here means that every time an element is selected, the remaining possibilities are chosen with probability proportional to their weights. It does not mean that elements appear in the output with a likelihood proportional to their weights.)
Here's an implementation of that, plentifully commented:
import random
class Node:
# Each node in the heap has a weight, value, and total weight.
# The total weight, self.tw, is self.w plus the weight of any children.
__slots__ = ['w', 'v', 'tw']
def __init__(self, w, v, tw):
self.w, self.v, self.tw = w, v, tw
def rws_heap(items):
# h is the heap. It's like a binary tree that lives in an array.
# It has a Node for each pair in `items`. h[1] is the root. Each
# other Node h[i] has a parent at h[i>>1]. Each node has up to 2
# children, h[i<<1] and h[(i<<1)+1]. To get this nice simple
# arithmetic, we have to leave h[0] vacant.
h = [None] # leave h[0] vacant
for w, v in items:
h.append(Node(w, v, w))
for i in range(len(h) - 1, 1, -1): # total up the tws
h[i>>1].tw += h[i].tw # add h[i]'s total to its parent
return h
def rws_heap_pop(h):
gas = h[1].tw * random.random() # start with a random amount of gas
i = 1 # start driving at the root
while gas >= h[i].w: # while we have enough gas to get past node i:
gas -= h[i].w # drive past node i
i <<= 1 # move to first child
if gas >= h[i].tw: # if we have enough gas:
gas -= h[i].tw # drive past first child and descendants
i += 1 # move to second child
w = h[i].w # out of gas! h[i] is the selected node.
v = h[i].v
h[i].w = 0 # make sure this node isn't chosen again
while i: # fix up total weights
h[i].tw -= w
i >>= 1
return v
def random_weighted_sample_no_replacement(items, n):
heap = rws_heap(items) # just make a heap...
for i in range(n):
yield rws_heap_pop(heap) # and pop n items off it.
If the sampling is with replacement, use the roulette-wheel selection technique (often used in genetic algorithms):
sort the weights
compute the cumulative weights
pick a random number in [0,1]*totalWeight
find the interval in which this number falls into
select the elements with the corresponding interval
repeat k times
If the sampling is without replacement, you can adapt the above technique by removing the selected element from the list after each iteration, then re-normalizing the weights so that their sum is 1 (valid probability distribution function)
I know this is a very old question, but I think there's a neat trick to do this in O(n) time if you apply a little math!
The exponential distribution has two very useful properties.
Given n samples from different exponential distributions with different rate parameters, the probability that a given sample is the minimum is equal to its rate parameter divided by the sum of all rate parameters.
It is "memoryless". So if you already know the minimum, then the probability that any of the remaining elements is the 2nd-to-min is the same as the probability that if the true min were removed (and never generated), that element would have been the new min. This seems obvious, but I think because of some conditional probability issues, it might not be true of other distributions.
Using fact 1, we know that choosing a single element can be done by generating these exponential distribution samples with rate parameter equal to the weight, and then choosing the one with minimum value.
Using fact 2, we know that we don't have to re-generate the exponential samples. Instead, just generate one for each element, and take the k elements with lowest samples.
Finding the lowest k can be done in O(n). Use the Quickselect algorithm to find the k-th element, then simply take another pass through all elements and output all lower than the k-th.
A useful note: if you don't have immediate access to a library to generate exponential distribution samples, it can be easily done by: -ln(rand())/weight
I've done this in Ruby
https://github.com/fl00r/pickup
require 'pickup'
pond = {
"selmon" => 1,
"carp" => 4,
"crucian" => 3,
"herring" => 6,
"sturgeon" => 8,
"gudgeon" => 10,
"minnow" => 20
}
pickup = Pickup.new(pond, uniq: true)
pickup.pick(3)
#=> [ "gudgeon", "herring", "minnow" ]
pickup.pick
#=> "herring"
pickup.pick
#=> "gudgeon"
pickup.pick
#=> "sturgeon"
If you want to generate large arrays of random integers with replacement, you can use piecewise linear interpolation. For example, using NumPy/SciPy:
import numpy
import scipy.interpolate
def weighted_randint(weights, size=None):
"""Given an n-element vector of weights, randomly sample
integers up to n with probabilities proportional to weights"""
n = weights.size
# normalize so that the weights sum to unity
weights = weights / numpy.linalg.norm(weights, 1)
# cumulative sum of weights
cumulative_weights = weights.cumsum()
# piecewise-linear interpolating function whose domain is
# the unit interval and whose range is the integers up to n
f = scipy.interpolate.interp1d(
numpy.hstack((0.0, weights)),
numpy.arange(n + 1), kind='linear')
return f(numpy.random.random(size=size)).astype(int)
This is not effective if you want to sample without replacement.
Here's a Go implementation from geodns:
package foo
import (
"log"
"math/rand"
)
type server struct {
Weight int
data interface{}
}
func foo(servers []server) {
// servers list is already sorted by the Weight attribute
// number of items to pick
max := 4
result := make([]server, max)
sum := 0
for _, r := range servers {
sum += r.Weight
}
for si := 0; si < max; si++ {
n := rand.Intn(sum + 1)
s := 0
for i := range servers {
s += int(servers[i].Weight)
if s >= n {
log.Println("Picked record", i, servers[i])
sum -= servers[i].Weight
result[si] = servers[i]
// remove the server from the list
servers = append(servers[:i], servers[i+1:]...)
break
}
}
}
return result
}
If you want to pick x elements from a weighted set without replacement such that elements are chosen with a probability proportional to their weights:
import random
def weighted_choose_subset(weighted_set, count):
"""Return a random sample of count elements from a weighted set.
weighted_set should be a sequence of tuples of the form
(item, weight), for example: [('a', 1), ('b', 2), ('c', 3)]
Each element from weighted_set shows up at most once in the
result, and the relative likelihood of two particular elements
showing up is equal to the ratio of their weights.
This works as follows:
1.) Line up the items along the number line from [0, the sum
of all weights) such that each item occupies a segment of
length equal to its weight.
2.) Randomly pick a number "start" in the range [0, total
weight / count).
3.) Find all the points "start + n/count" (for all integers n
such that the point is within our segments) and yield the set
containing the items marked by those points.
Note that this implementation may not return each possible
subset. For example, with the input ([('a': 1), ('b': 1),
('c': 1), ('d': 1)], 2), it may only produce the sets ['a',
'c'] and ['b', 'd'], but it will do so such that the weights
are respected.
This implementation only works for nonnegative integral
weights. The highest weight in the input set must be less
than the total weight divided by the count; otherwise it would
be impossible to respect the weights while never returning
that element more than once per invocation.
"""
if count == 0:
return []
total_weight = 0
max_weight = 0
borders = []
for item, weight in weighted_set:
if weight < 0:
raise RuntimeError("All weights must be positive integers")
# Scale up weights so dividing total_weight / count doesn't truncate:
weight *= count
total_weight += weight
borders.append(total_weight)
max_weight = max(max_weight, weight)
step = int(total_weight / count)
if max_weight > step:
raise RuntimeError(
"Each weight must be less than total weight / count")
next_stop = random.randint(0, step - 1)
results = []
current = 0
for i in range(count):
while borders[current] <= next_stop:
current += 1
results.append(weighted_set[current][0])
next_stop += step
return results
In the question you linked to, Kyle's solution would work with a trivial generalization.
Scan the list and sum the total weights. Then the probability to choose an element should be:
1 - (1 - (#needed/(weight left)))/(weight at n). After visiting a node, subtract it's weight from the total. Also, if you need n and have n left, you have to stop explicitly.
You can check that with everything having weight 1, this simplifies to kyle's solution.
Edited: (had to rethink what twice as likely meant)
This one does exactly that with O(n) and no excess memory usage. I believe this is a clever and efficient solution easy to port to any language. The first two lines are just to populate sample data in Drupal.
function getNrandomGuysWithWeight($numitems){
$q = db_query('SELECT id, weight FROM theTableWithTheData');
$q = $q->fetchAll();
$accum = 0;
foreach($q as $r){
$accum += $r->weight;
$r->weight = $accum;
}
$out = array();
while(count($out) < $numitems && count($q)){
$n = rand(0,$accum);
$lessaccum = NULL;
$prevaccum = 0;
$idxrm = 0;
foreach($q as $i=>$r){
if(($lessaccum == NULL) && ($n <= $r->weight)){
$out[] = $r->id;
$lessaccum = $r->weight- $prevaccum;
$accum -= $lessaccum;
$idxrm = $i;
}else if($lessaccum){
$r->weight -= $lessaccum;
}
$prevaccum = $r->weight;
}
unset($q[$idxrm]);
}
return $out;
}
I putting here a simple solution for picking 1 item, you can easily expand it for k items (Java style):
double random = Math.random();
double sum = 0;
for (int i = 0; i < items.length; i++) {
val = items[i];
sum += val.getValue();
if (sum > random) {
selected = val;
break;
}
}
I have implemented an algorithm similar to Jason Orendorff's idea in Rust here. My version additionally supports bulk operations: insert and remove (when you want to remove a bunch of items given by their ids, not through the weighted selection path) from the data structure in O(m + log n) time where m is the number of items to remove and n the number of items in stored.
Sampling wihout replacement with recursion - elegant and very short solution in c#
//how many ways we can choose 4 out of 60 students, so that every time we choose different 4
class Program
{
static void Main(string[] args)
{
int group = 60;
int studentsToChoose = 4;
Console.WriteLine(FindNumberOfStudents(studentsToChoose, group));
}
private static int FindNumberOfStudents(int studentsToChoose, int group)
{
if (studentsToChoose == group || studentsToChoose == 0)
return 1;
return FindNumberOfStudents(studentsToChoose, group - 1) + FindNumberOfStudents(studentsToChoose - 1, group - 1);
}
}
I just spent a few hours trying to get behind the algorithms underlying sampling without replacement out there and this topic is more complex than I initially thought. That's exciting! For the benefit of a future readers (have a good day!) I document my insights here including a ready to use function which respects the given inclusion probabilities further below. A nice and quick mathematical overview of the various methods can be found here: Tillé: Algorithms of sampling with equal or unequal probabilities. For example Jason's method can be found on page 46. The caveat with his method is that the weights are not proportional to the inclusion probabilities as also noted in the document. Actually, the i-th inclusion probabilities can be recursively computed as follows:
def inclusion_probability(i, weights, k):
"""
Computes the inclusion probability of the i-th element
in a randomly sampled k-tuple using Jason's algorithm
(see https://stackoverflow.com/a/2149533/7729124)
"""
if k <= 0: return 0
cum_p = 0
for j, weight in enumerate(weights):
# compute the probability of j being selected considering the weights
p = weight / sum(weights)
if i == j:
# if this is the target element, we don't have to go deeper,
# since we know that i is included
cum_p += p
else:
# if this is not the target element, than we compute the conditional
# inclusion probability of i under the constraint that j is included
cond_i = i if i < j else i-1
cond_weights = weights[:j] + weights[j+1:]
cond_p = inclusion_probability(cond_i, cond_weights, k-1)
cum_p += p * cond_p
return cum_p
And we can check the validity of the function above by comparing
In : for i in range(3): print(i, inclusion_probability(i, [1,2,3], 2))
0 0.41666666666666663
1 0.7333333333333333
2 0.85
to
In : import collections, itertools
In : sample_tester = lambda f: collections.Counter(itertools.chain(*(f() for _ in range(10000))))
In : sample_tester(lambda: random_weighted_sample_no_replacement([(1,'a'),(2,'b'),(3,'c')],2))
Out: Counter({'a': 4198, 'b': 7268, 'c': 8534})
One way - also suggested in the document above - to specify the inclusion probabilities is to compute the weights from them. The whole complexity of the question at hand stems from the fact that one cannot do that directly since one basically has to invert the recursion formula, symbolically I claim this is impossible. Numerically it can be done using all kind of methods, e.g. Newton's method. However the complexity of inverting the Jacobian using plain Python becomes unbearable quickly, I really recommend looking into numpy.random.choice in this case.
Luckily there is method using plain Python which might or might not be sufficiently performant for your purposes, it works great if there aren't that many different weights. You can find the algorithm on page 75&76. It works by splitting up the sampling process into parts with the same inclusion probabilities, i.e. we can use random.sample again! I am not going to explain the principle here since the basics are nicely presented on page 69. Here is the code with hopefully a sufficient amount of comments:
def sample_no_replacement_exact(items, k, best_effort=False, random_=None, ε=1e-9):
"""
Returns a random sample of k elements from items, where items is a list of
tuples (weight, element). The inclusion probability of an element in the
final sample is given by
k * weight / sum(weights).
Note that the function raises if a inclusion probability cannot be
satisfied, e.g the following call is obviously illegal:
sample_no_replacement_exact([(1,'a'),(2,'b')],2)
Since selecting two elements means selecting both all the time,
'b' cannot be selected twice as often as 'a'. In general it can be hard to
spot if the weights are illegal and the function does *not* always raise
an exception in that case. To remedy the situation you can pass
best_effort=True which redistributes the inclusion probability mass
if necessary. Note that the inclusion probabilities will change
if deemed necessary.
The algorithm is based on the splitting procedure on page 75/76 in:
http://www.eustat.eus/productosServicios/52.1_Unequal_prob_sampling.pdf
Additional information can be found here:
https://stackoverflow.com/questions/2140787/
:param items: list of tuples of type weight,element
:param k: length of resulting sample
:param best_effort: fix inclusion probabilities if necessary,
(optional, defaults to False)
:param random_: random module to use (optional, defaults to the
standard random module)
:param ε: fuzziness parameter when testing for zero in the context
of floating point arithmetic (optional, defaults to 1e-9)
:return: random sample set of size k
:exception: throws ValueError in case of bad parameters,
throws AssertionError in case of algorithmic impossibilities
"""
# random_ defaults to the random submodule
if not random_:
random_ = random
# special case empty return set
if k <= 0:
return set()
if k > len(items):
raise ValueError("resulting tuple length exceeds number of elements (k > n)")
# sort items by weight
items = sorted(items, key=lambda item: item[0])
# extract the weights and elements
weights, elements = list(zip(*items))
# compute the inclusion probabilities (short: π) of the elements
scaling_factor = k / sum(weights)
π = [scaling_factor * weight for weight in weights]
# in case of best_effort: if a inclusion probability exceeds 1,
# try to rebalance the probabilities such that:
# a) no probability exceeds 1,
# b) the probabilities still sum to k, and
# c) the probability masses flow from top to bottom:
# [0.2, 0.3, 1.5] -> [0.2, 0.8, 1]
# (remember that π is sorted)
if best_effort and π[-1] > 1 + ε:
# probability mass we still we have to distribute
debt = 0.
for i in reversed(range(len(π))):
if π[i] > 1.:
# an 'offender', take away excess
debt += π[i] - 1.
π[i] = 1.
else:
# case π[i] < 1, i.e. 'save' element
# maximum we can transfer from debt to π[i] and still not
# exceed 1 is computed by the minimum of:
# a) 1 - π[i], and
# b) debt
max_transfer = min(debt, 1. - π[i])
debt -= max_transfer
π[i] += max_transfer
assert debt < ε, "best effort rebalancing failed (impossible)"
# make sure we are talking about probabilities
if any(not (0 - ε <= π_i <= 1 + ε) for π_i in π):
raise ValueError("inclusion probabilities not satisfiable: {}" \
.format(list(zip(π, elements))))
# special case equal probabilities
# (up to fuzziness parameter, remember that π is sorted)
if π[-1] < π[0] + ε:
return set(random_.sample(elements, k))
# compute the two possible lambda values, see formula 7 on page 75
# (remember that π is sorted)
λ1 = π[0] * len(π) / k
λ2 = (1 - π[-1]) * len(π) / (len(π) - k)
λ = min(λ1, λ2)
# there are two cases now, see also page 69
# CASE 1
# with probability λ we are in the equal probability case
# where all elements have the same inclusion probability
if random_.random() < λ:
return set(random_.sample(elements, k))
# CASE 2:
# with probability 1-λ we are in the case of a new sample without
# replacement problem which is strictly simpler,
# it has the following new probabilities (see page 75, π^{(2)}):
new_π = [
(π_i - λ * k / len(π))
/
(1 - λ)
for π_i in π
]
new_items = list(zip(new_π, elements))
# the first few probabilities might be 0, remove them
# NOTE: we make sure that floating point issues do not arise
# by using the fuzziness parameter
while new_items and new_items[0][0] < ε:
new_items = new_items[1:]
# the last few probabilities might be 1, remove them and mark them as selected
# NOTE: we make sure that floating point issues do not arise
# by using the fuzziness parameter
selected_elements = set()
while new_items and new_items[-1][0] > 1 - ε:
selected_elements.add(new_items[-1][1])
new_items = new_items[:-1]
# the algorithm reduces the length of the sample problem,
# it is guaranteed that:
# if λ = λ1: the first item has probability 0
# if λ = λ2: the last item has probability 1
assert len(new_items) < len(items), "problem was not simplified (impossible)"
# recursive call with the simpler sample problem
# NOTE: we have to make sure that the selected elements are included
return sample_no_replacement_exact(
new_items,
k - len(selected_elements),
best_effort=best_effort,
random_=random_,
ε=ε
) | selected_elements
Example:
In : sample_no_replacement_exact([(1,'a'),(2,'b'),(3,'c')],2)
Out: {'b', 'c'}
In : import collections, itertools
In : sample_tester = lambda f: collections.Counter(itertools.chain(*(f() for _ in range(10000))))
In : sample_tester(lambda: sample_no_replacement_exact([(1,'a'),(2,'b'),(3,'c'),(4,'d')],2))
Out: Counter({'a': 2048, 'b': 4051, 'c': 5979, 'd': 7922})
The weights sum up to 10, hence the inclusion probabilities compute to: a → 20%, b → 40%, c → 60%, d → 80%. (Sum: 200% = k.) It works!
Just one word of caution for the productive use of this function, it can be very hard to spot illegal inputs for the weights. An obvious illegal example is
In: sample_no_replacement_exact([(1,'a'),(2,'b')],2)
ValueError: inclusion probabilities not satisfiable: [(0.6666666666666666, 'a'), (1.3333333333333333, 'b')]
b cannot appear twice as often as a since both have to be always be selected. There are more subtle examples. To avoid an exception in production just use best_effort=True, which rebalances the inclusion probability mass such that we have always a valid distribution. Obviously this might change the inclusion probabilities.
I used a associative map (weight,object). for example:
{
(10,"low"),
(100,"mid"),
(10000,"large")
}
total=10110
peek a random number between 0 and 'total' and iterate over the keys until this number fits in a given range.

Resources