is Valid Golomb Ruler - algorithm

I want to write a function which can test whether a list of numbers is a valid Golomb ruler or not. I know how to do it in O(n^2) time (using nested for loops) but I am looking for a simple and more optimized way of doing it. I am trying to do it in python and my function takes a list of integers as argument.
A Golomb ruler is defined as a set :
Iff
(Wikipedia)

If this is 3SUM-hard, then no one knows how to do much better than quadratic. I would be amazed if this weren't 3SUM-hard, but the hypothetical reduction looks as though it would be rather technical.

it's my code in python which checks if the list of integers forms a rule of Golomb or not:
def is_golomb_ruler(ruler: list[int]) -> bool:
marks_count = len(ruler)
# check if first element is not 0
if ruler[0] != 0:
return False
# check if contains negative values
for item in ruler:
if item < 0:
return False
# check differences table
tab_diff = []
for i in range(marks_count - 1):
d = ruler[i + 1] - ruler[i]
if tab_diff.__contains__(d):
return False
tab_diff.append(d)
index = marks_count - 2
diff_len = 2
for i in range(index, 0, -1):
for j in range(i):
d = 0
for k in range(diff_len):
d += tab_diff[k + j]
if tab_diff.__contains__(d):
return False
tab_diff.append(d)
diff_len += 1
# case of two identical marks
if tab_diff.__contains__(0):
return False
return True
you find my repo of the complete Golomb problem solved by the metaheuristic method (Simulated annealing and genetic algorithms) here :
GolombProblem_Metaheuristic_Project

Related

JAVA - Recursive function to determine all prime factors of n

What I have problems understanding is how can I write a recursive method that adds to the array, if I can only have n as parameter to my function.
You have two cases: base case and recursion case. For your problem, the high-level logic looks like this:
if n is prime
return array(n) // return a one-element array
else {
find a prime divisor, p
// return an array of p and the factorisation of n/p
return array(p, FACTORIZATION(n/p) )
}
Does that get you moving? You'll need to know how to make and append to arrays in your chosen language, but those are implementation details.
It would look either like:
def factorize(n):
factors= list()
found= False
t= 2
while t*t <= n and not found:
while (n % t) == 0:
# divisible by 2
factors.append(t)
found= True
n//= t
t+= 1
if found:
factors.extend(factorize(n))
else:
factors.append(n)
return factors
factorize(3*5*5*7*7*31*101)
# --> [3, 5, 5, 7, 7, 31, 101]
Which is a naiive apporoach, to keep it simple. Or you allow some more (named) arguments to your recursive function, which would also allow passing a list. Like:
def factorize2(n, result=None, t=2):
if result:
factors= result
else:
factors= list()
found= False
while t*t <= n and not found:
while (n % t) == 0:
factors.append(t)
found= True
n//= t
t+= 1
if found:
factorize2(n, factors, t+1)
else:
factors.append(n)
return factors
The basic difference is, that here you reuse the list the top level function created. This way you might give the garbabge collector a little less work (though in case of a factorization function this probably doesn't make much of a difference, but in other cases I think it does). The second point is, that you already tested some factors and don't have to retest them. This is why I pass t.
Of course this is still naiive. You can easily improve the performance, by avoiding the t*t < n check in each iteration and by just testing t if t is -1/1 mod 6 and so on.
Another approach to have in your toolbox is to not return an array, but rather a linked list. That is a data structure where each piece of data links to the next, links to the next, and so on. Factorization doesn't really show the power of it, but here it is anyways:
def factorize(n, start=2):
i = start
while i < n:
if n % i == 0:
return [i, factorize(n//i, i)]
elif n < i*i:
break
i = i + 1
if 1 < i:
return [n, None]
else:
return None
print(factorize(3*5*5*7*7*31*101)) # [3, [5, [5, [7, [7, [31, [101, None]]]]]]]
The win with this approach is that it does not modify the returned data structure. So if you're doing something like searching for an optimal path through a graph, you can track multiple next moves without conflict. I find this particularly useful when modifying dynamic programming algorithms to actually find the best solution, rather than to report how good it is.
The one difficulty is that you wind up with a nested data structure. But you can always flatten it as follows:
def flatten_linked_list (ll):
answer = []
while ll is not None:
answer.append(ll[0])
ll = ll[1]
return answer
# prints [3, 5, 5, 7, 7, 31, 101]
print(flatten_linked_list( factorize(3*5*5*7*7*31*101) ))
Here are two recursive methods. The first has one parameter and uses a while loop to find a divisor, then uses divide and conquer to recursively query the factors for each of the two results of that division. (This method is more of an exercise since we can easily do this much more efficiently.) The second relies on a second parameter as a pointer to the current prime factor, which allows for a much more efficient direct enumeration.
JavaScript code:
function f(n){
let a = ~~Math.sqrt(n)
while (n % a)
a--
if (a < 2)
return n == 1 ? [] : [n]
let b = n / a
let [fa, fb] = [f(a), f(b)]
return (fa.length > 1 ? fa : [a]).concat(
fb.length > 1 ? fb : [b])
}
function g(n, d=2){
if (n % d && d*d < n)
return g(n, d == 2 ? 3 : d + 2)
else if (d*d <= n)
return [d].concat(g(n / d, d))
return n == 1 ? [] : [n]
}
console.log(f(567))
console.log(g(12345))

Probabilistic Sieve of Eratosthenes

Consider the following algorithm.
function Rand():
return a uniformly random real between 0.0 and 1.0
function Sieve(n):
assert(n >= 2)
for i = 2 to n
X[i] = true
for i = 2 to n
if (X[i])
for j = i+1 to n
if (Rand() < 1/i)
X[j] = false
return X[n]
What is the probability that Sieve(k) returns true as a function of k ?
Let's define a series of random variables recursively:
Let Xk,r denote the indicator variable, taking value 1 iff X[k] == true by the end of the iteration in which the variable i took value r.
In order to have fewer symbols and since it makes more intuitive sense with the code, we'll just write Xk,i which is valid although would have been confusing in the definition since i taking value i is confusing when the first refers to the variable in the loop and the latter to the value of the variable.
Now we note that:
P(Xk,i ~ 0) = P(Xk,i-1 ~ 0) + P(Xk,i-1 ~ 1) * P(Xk-1,i-1 ~ 1) * 1/i
(~ is used in place of = just to make it understandable, since = would otherwise take two separate meanings and looks confusing).
This equality holds by virtue of the fact that either X[k] was false at the end of the i iteration either because it was false at the end of the i-1, or it was true at that point, but in that last iteration X[k-1] was true and so we entered the loop and changed X[k] with probability of 1/i. The events are mutually exclusive, so there is no intersection.
The base of the recursion is simply the fact that P(Xk,1 ~ 1) = 1 and P(X2,i ~ 1) = 1.
Lastly, we note simply that P(X[k] == true) = P(Xk,k-1 ~ 1).
This can be programmed rather easily. Here's a javascript implementation that employs memoisation (you can benchmark if using nested indices is better than string concatenation for the dictionary index, you could also redesign the calculation to maintain the same runtime complexity but not run out of stack size by building bottom-up and not top-down). Naturally the implementation will have a runtime complexity of O(k^2) so it's not practical for arbitrarily large numbers:
function P(k) {
if (k<2 || k!==Math.round(k)) return -1;
var _ = {};
function _P(n,i) {
if(n===2) return 1;
if(i===1) return 1;
var $ = n+'_'+i;
if($ in _) return _[$];
return _[$] = 1-(1-_P(n,i-1) + _P(n,i-1)*_P(n-1,i-1)*1/i);
}
return _P(k,k-1);
}
P(1000); // 0.12274162882390949
More interesting would be how the 1/i probability changes things. I.e. whether or not the probability converges to 0 or to some other value, and if so, how changing the 1/i affects that.
Of course if you ask on mathSE you might get a better answer - this answer is pretty simplistic, I'm sure there is a way to manipulate it to acquire a direct formula.

Find the minimum number of operations required to compute a number using a specified range of numbers

Let me start with an example -
I have a range of numbers from 1 to 9. And let's say the target number that I want is 29.
In this case the minimum number of operations that are required would be (9*3)+2 = 2 operations. Similarly for 18 the minimum number of operations is 1 (9*2=18).
I can use any of the 4 arithmetic operators - +, -, / and *.
How can I programmatically find out the minimum number of operations required?
Thanks in advance for any help provided.
clarification: integers only, no decimals allowed mid-calculation. i.e. the following is not valid (from comments below): ((9/2) + 1) * 4 == 22
I must admit I didn't think about this thoroughly, but for my purpose it doesn't matter if decimal numbers appear mid-calculation. ((9/2) + 1) * 4 == 22 is valid. Sorry for the confusion.
For the special case where set Y = [1..9] and n > 0:
n <= 9 : 0 operations
n <=18 : 1 operation (+)
otherwise : Remove any divisor found in Y. If this is not enough, do a recursion on the remainder for all offsets -9 .. +9. Offset 0 can be skipped as it has already been tried.
Notice how division is not needed in this case. For other Y this does not hold.
This algorithm is exponential in log(n). The exact analysis is a job for somebody with more knowledge about algebra than I.
For more speed, add pruning to eliminate some of the search for larger numbers.
Sample code:
def findop(n, maxlen=9999):
# Return a short postfix list of numbers and operations
# Simple solution to small numbers
if n<=9: return [n]
if n<=18: return [9,n-9,'+']
# Find direct multiply
x = divlist(n)
if len(x) > 1:
mults = len(x)-1
x[-1:] = findop(x[-1], maxlen-2*mults)
x.extend(['*'] * mults)
return x
shortest = 0
for o in range(1,10) + range(-1,-10,-1):
x = divlist(n-o)
if len(x) == 1: continue
mults = len(x)-1
# We spent len(divlist) + mults + 2 fields for offset.
# The last number is expanded by the recursion, so it doesn't count.
recursion_maxlen = maxlen - len(x) - mults - 2 + 1
if recursion_maxlen < 1: continue
x[-1:] = findop(x[-1], recursion_maxlen)
x.extend(['*'] * mults)
if o > 0:
x.extend([o, '+'])
else:
x.extend([-o, '-'])
if shortest == 0 or len(x) < shortest:
shortest = len(x)
maxlen = shortest - 1
solution = x[:]
if shortest == 0:
# Fake solution, it will be discarded
return '#' * (maxlen+1)
return solution
def divlist(n):
l = []
for d in range(9,1,-1):
while n%d == 0:
l.append(d)
n = n/d
if n>1: l.append(n)
return l
The basic idea is to test all possibilities with k operations, for k starting from 0. Imagine you create a tree of height k that branches for every possible new operation with operand (4*9 branches per level). You need to traverse and evaluate the leaves of the tree for each k before moving to the next k.
I didn't test this pseudo-code:
for every k from 0 to infinity
for every n from 1 to 9
if compute(n,0,k):
return k
boolean compute(n,j,k):
if (j == k):
return (n == target)
else:
for each operator in {+,-,*,/}:
for every i from 1 to 9:
if compute((n operator i),j+1,k):
return true
return false
It doesn't take into account arithmetic operators precedence and braces, that would require some rework.
Really cool question :)
Notice that you can start from the end! From your example (9*3)+2 = 29 is equivalent to saying (29-2)/3=9. That way we can avoid the double loop in cyborg's answer. This suggests the following algorithm for set Y and result r:
nextleaves = {r}
nops = 0
while(true):
nops = nops+1
leaves = nextleaves
nextleaves = {}
for leaf in leaves:
for y in Y:
if (leaf+y) or (leaf-y) or (leaf*y) or (leaf/y) is in X:
return(nops)
else:
add (leaf+y) and (leaf-y) and (leaf*y) and (leaf/y) to nextleaves
This is the basic idea, performance can be certainly be improved, for instance by avoiding "backtracks", such as r+a-a or r*a*b/a.
I guess my idea is similar to the one of Peer Sommerlund:
For big numbers, you advance fast, by multiplication with big ciphers.
Is Y=29 prime? If not, divide it by the maximum divider of (2 to 9).
Else you could subtract a number, to reach a dividable number. 27 is fine, since it is dividable by 9, so
(29-2)/9=3 =>
3*9+2 = 29
So maybe - I didn't think about this to the end: Search the next divisible by 9 number below Y. If you don't reach a number which is a digit, repeat.
The formula is the steps reversed.
(I'll try it for some numbers. :) )
I tried with 2551, which is
echo $((((3*9+4)*9+4)*9+4))
But I didn't test every intermediate result whether it is prime.
But
echo $((8*8*8*5-9))
is 2 operations less. Maybe I can investigate this later.

Select k random elements from a list whose elements have weights

Selecting without any weights (equal probabilities) is beautifully described here.
I was wondering if there is a way to convert this approach to a weighted one.
I am also interested in other approaches as well.
Update: Sampling without replacement
If the sampling is with replacement, you can use this algorithm (implemented here in Python):
import random
items = [(10, "low"),
(100, "mid"),
(890, "large")]
def weighted_sample(items, n):
total = float(sum(w for w, v in items))
i = 0
w, v = items[0]
while n:
x = total * (1 - random.random() ** (1.0 / n))
total -= x
while x > w:
x -= w
i += 1
w, v = items[i]
w -= x
yield v
n -= 1
This is O(n + m) where m is the number of items.
Why does this work? It is based on the following algorithm:
def n_random_numbers_decreasing(v, n):
"""Like reversed(sorted(v * random() for i in range(n))),
but faster because we avoid sorting."""
while n:
v *= random.random() ** (1.0 / n)
yield v
n -= 1
The function weighted_sample is just this algorithm fused with a walk of the items list to pick out the items selected by those random numbers.
This in turn works because the probability that n random numbers 0..v will all happen to be less than z is P = (z/v)n. Solve for z, and you get z = vP1/n. Substituting a random number for P picks the largest number with the correct distribution; and we can just repeat the process to select all the other numbers.
If the sampling is without replacement, you can put all the items into a binary heap, where each node caches the total of the weights of all items in that subheap. Building the heap is O(m). Selecting a random item from the heap, respecting the weights, is O(log m). Removing that item and updating the cached totals is also O(log m). So you can pick n items in O(m + n log m) time.
(Note: "weight" here means that every time an element is selected, the remaining possibilities are chosen with probability proportional to their weights. It does not mean that elements appear in the output with a likelihood proportional to their weights.)
Here's an implementation of that, plentifully commented:
import random
class Node:
# Each node in the heap has a weight, value, and total weight.
# The total weight, self.tw, is self.w plus the weight of any children.
__slots__ = ['w', 'v', 'tw']
def __init__(self, w, v, tw):
self.w, self.v, self.tw = w, v, tw
def rws_heap(items):
# h is the heap. It's like a binary tree that lives in an array.
# It has a Node for each pair in `items`. h[1] is the root. Each
# other Node h[i] has a parent at h[i>>1]. Each node has up to 2
# children, h[i<<1] and h[(i<<1)+1]. To get this nice simple
# arithmetic, we have to leave h[0] vacant.
h = [None] # leave h[0] vacant
for w, v in items:
h.append(Node(w, v, w))
for i in range(len(h) - 1, 1, -1): # total up the tws
h[i>>1].tw += h[i].tw # add h[i]'s total to its parent
return h
def rws_heap_pop(h):
gas = h[1].tw * random.random() # start with a random amount of gas
i = 1 # start driving at the root
while gas >= h[i].w: # while we have enough gas to get past node i:
gas -= h[i].w # drive past node i
i <<= 1 # move to first child
if gas >= h[i].tw: # if we have enough gas:
gas -= h[i].tw # drive past first child and descendants
i += 1 # move to second child
w = h[i].w # out of gas! h[i] is the selected node.
v = h[i].v
h[i].w = 0 # make sure this node isn't chosen again
while i: # fix up total weights
h[i].tw -= w
i >>= 1
return v
def random_weighted_sample_no_replacement(items, n):
heap = rws_heap(items) # just make a heap...
for i in range(n):
yield rws_heap_pop(heap) # and pop n items off it.
If the sampling is with replacement, use the roulette-wheel selection technique (often used in genetic algorithms):
sort the weights
compute the cumulative weights
pick a random number in [0,1]*totalWeight
find the interval in which this number falls into
select the elements with the corresponding interval
repeat k times
If the sampling is without replacement, you can adapt the above technique by removing the selected element from the list after each iteration, then re-normalizing the weights so that their sum is 1 (valid probability distribution function)
I know this is a very old question, but I think there's a neat trick to do this in O(n) time if you apply a little math!
The exponential distribution has two very useful properties.
Given n samples from different exponential distributions with different rate parameters, the probability that a given sample is the minimum is equal to its rate parameter divided by the sum of all rate parameters.
It is "memoryless". So if you already know the minimum, then the probability that any of the remaining elements is the 2nd-to-min is the same as the probability that if the true min were removed (and never generated), that element would have been the new min. This seems obvious, but I think because of some conditional probability issues, it might not be true of other distributions.
Using fact 1, we know that choosing a single element can be done by generating these exponential distribution samples with rate parameter equal to the weight, and then choosing the one with minimum value.
Using fact 2, we know that we don't have to re-generate the exponential samples. Instead, just generate one for each element, and take the k elements with lowest samples.
Finding the lowest k can be done in O(n). Use the Quickselect algorithm to find the k-th element, then simply take another pass through all elements and output all lower than the k-th.
A useful note: if you don't have immediate access to a library to generate exponential distribution samples, it can be easily done by: -ln(rand())/weight
I've done this in Ruby
https://github.com/fl00r/pickup
require 'pickup'
pond = {
"selmon" => 1,
"carp" => 4,
"crucian" => 3,
"herring" => 6,
"sturgeon" => 8,
"gudgeon" => 10,
"minnow" => 20
}
pickup = Pickup.new(pond, uniq: true)
pickup.pick(3)
#=> [ "gudgeon", "herring", "minnow" ]
pickup.pick
#=> "herring"
pickup.pick
#=> "gudgeon"
pickup.pick
#=> "sturgeon"
If you want to generate large arrays of random integers with replacement, you can use piecewise linear interpolation. For example, using NumPy/SciPy:
import numpy
import scipy.interpolate
def weighted_randint(weights, size=None):
"""Given an n-element vector of weights, randomly sample
integers up to n with probabilities proportional to weights"""
n = weights.size
# normalize so that the weights sum to unity
weights = weights / numpy.linalg.norm(weights, 1)
# cumulative sum of weights
cumulative_weights = weights.cumsum()
# piecewise-linear interpolating function whose domain is
# the unit interval and whose range is the integers up to n
f = scipy.interpolate.interp1d(
numpy.hstack((0.0, weights)),
numpy.arange(n + 1), kind='linear')
return f(numpy.random.random(size=size)).astype(int)
This is not effective if you want to sample without replacement.
Here's a Go implementation from geodns:
package foo
import (
"log"
"math/rand"
)
type server struct {
Weight int
data interface{}
}
func foo(servers []server) {
// servers list is already sorted by the Weight attribute
// number of items to pick
max := 4
result := make([]server, max)
sum := 0
for _, r := range servers {
sum += r.Weight
}
for si := 0; si < max; si++ {
n := rand.Intn(sum + 1)
s := 0
for i := range servers {
s += int(servers[i].Weight)
if s >= n {
log.Println("Picked record", i, servers[i])
sum -= servers[i].Weight
result[si] = servers[i]
// remove the server from the list
servers = append(servers[:i], servers[i+1:]...)
break
}
}
}
return result
}
If you want to pick x elements from a weighted set without replacement such that elements are chosen with a probability proportional to their weights:
import random
def weighted_choose_subset(weighted_set, count):
"""Return a random sample of count elements from a weighted set.
weighted_set should be a sequence of tuples of the form
(item, weight), for example: [('a', 1), ('b', 2), ('c', 3)]
Each element from weighted_set shows up at most once in the
result, and the relative likelihood of two particular elements
showing up is equal to the ratio of their weights.
This works as follows:
1.) Line up the items along the number line from [0, the sum
of all weights) such that each item occupies a segment of
length equal to its weight.
2.) Randomly pick a number "start" in the range [0, total
weight / count).
3.) Find all the points "start + n/count" (for all integers n
such that the point is within our segments) and yield the set
containing the items marked by those points.
Note that this implementation may not return each possible
subset. For example, with the input ([('a': 1), ('b': 1),
('c': 1), ('d': 1)], 2), it may only produce the sets ['a',
'c'] and ['b', 'd'], but it will do so such that the weights
are respected.
This implementation only works for nonnegative integral
weights. The highest weight in the input set must be less
than the total weight divided by the count; otherwise it would
be impossible to respect the weights while never returning
that element more than once per invocation.
"""
if count == 0:
return []
total_weight = 0
max_weight = 0
borders = []
for item, weight in weighted_set:
if weight < 0:
raise RuntimeError("All weights must be positive integers")
# Scale up weights so dividing total_weight / count doesn't truncate:
weight *= count
total_weight += weight
borders.append(total_weight)
max_weight = max(max_weight, weight)
step = int(total_weight / count)
if max_weight > step:
raise RuntimeError(
"Each weight must be less than total weight / count")
next_stop = random.randint(0, step - 1)
results = []
current = 0
for i in range(count):
while borders[current] <= next_stop:
current += 1
results.append(weighted_set[current][0])
next_stop += step
return results
In the question you linked to, Kyle's solution would work with a trivial generalization.
Scan the list and sum the total weights. Then the probability to choose an element should be:
1 - (1 - (#needed/(weight left)))/(weight at n). After visiting a node, subtract it's weight from the total. Also, if you need n and have n left, you have to stop explicitly.
You can check that with everything having weight 1, this simplifies to kyle's solution.
Edited: (had to rethink what twice as likely meant)
This one does exactly that with O(n) and no excess memory usage. I believe this is a clever and efficient solution easy to port to any language. The first two lines are just to populate sample data in Drupal.
function getNrandomGuysWithWeight($numitems){
$q = db_query('SELECT id, weight FROM theTableWithTheData');
$q = $q->fetchAll();
$accum = 0;
foreach($q as $r){
$accum += $r->weight;
$r->weight = $accum;
}
$out = array();
while(count($out) < $numitems && count($q)){
$n = rand(0,$accum);
$lessaccum = NULL;
$prevaccum = 0;
$idxrm = 0;
foreach($q as $i=>$r){
if(($lessaccum == NULL) && ($n <= $r->weight)){
$out[] = $r->id;
$lessaccum = $r->weight- $prevaccum;
$accum -= $lessaccum;
$idxrm = $i;
}else if($lessaccum){
$r->weight -= $lessaccum;
}
$prevaccum = $r->weight;
}
unset($q[$idxrm]);
}
return $out;
}
I putting here a simple solution for picking 1 item, you can easily expand it for k items (Java style):
double random = Math.random();
double sum = 0;
for (int i = 0; i < items.length; i++) {
val = items[i];
sum += val.getValue();
if (sum > random) {
selected = val;
break;
}
}
I have implemented an algorithm similar to Jason Orendorff's idea in Rust here. My version additionally supports bulk operations: insert and remove (when you want to remove a bunch of items given by their ids, not through the weighted selection path) from the data structure in O(m + log n) time where m is the number of items to remove and n the number of items in stored.
Sampling wihout replacement with recursion - elegant and very short solution in c#
//how many ways we can choose 4 out of 60 students, so that every time we choose different 4
class Program
{
static void Main(string[] args)
{
int group = 60;
int studentsToChoose = 4;
Console.WriteLine(FindNumberOfStudents(studentsToChoose, group));
}
private static int FindNumberOfStudents(int studentsToChoose, int group)
{
if (studentsToChoose == group || studentsToChoose == 0)
return 1;
return FindNumberOfStudents(studentsToChoose, group - 1) + FindNumberOfStudents(studentsToChoose - 1, group - 1);
}
}
I just spent a few hours trying to get behind the algorithms underlying sampling without replacement out there and this topic is more complex than I initially thought. That's exciting! For the benefit of a future readers (have a good day!) I document my insights here including a ready to use function which respects the given inclusion probabilities further below. A nice and quick mathematical overview of the various methods can be found here: Tillé: Algorithms of sampling with equal or unequal probabilities. For example Jason's method can be found on page 46. The caveat with his method is that the weights are not proportional to the inclusion probabilities as also noted in the document. Actually, the i-th inclusion probabilities can be recursively computed as follows:
def inclusion_probability(i, weights, k):
"""
Computes the inclusion probability of the i-th element
in a randomly sampled k-tuple using Jason's algorithm
(see https://stackoverflow.com/a/2149533/7729124)
"""
if k <= 0: return 0
cum_p = 0
for j, weight in enumerate(weights):
# compute the probability of j being selected considering the weights
p = weight / sum(weights)
if i == j:
# if this is the target element, we don't have to go deeper,
# since we know that i is included
cum_p += p
else:
# if this is not the target element, than we compute the conditional
# inclusion probability of i under the constraint that j is included
cond_i = i if i < j else i-1
cond_weights = weights[:j] + weights[j+1:]
cond_p = inclusion_probability(cond_i, cond_weights, k-1)
cum_p += p * cond_p
return cum_p
And we can check the validity of the function above by comparing
In : for i in range(3): print(i, inclusion_probability(i, [1,2,3], 2))
0 0.41666666666666663
1 0.7333333333333333
2 0.85
to
In : import collections, itertools
In : sample_tester = lambda f: collections.Counter(itertools.chain(*(f() for _ in range(10000))))
In : sample_tester(lambda: random_weighted_sample_no_replacement([(1,'a'),(2,'b'),(3,'c')],2))
Out: Counter({'a': 4198, 'b': 7268, 'c': 8534})
One way - also suggested in the document above - to specify the inclusion probabilities is to compute the weights from them. The whole complexity of the question at hand stems from the fact that one cannot do that directly since one basically has to invert the recursion formula, symbolically I claim this is impossible. Numerically it can be done using all kind of methods, e.g. Newton's method. However the complexity of inverting the Jacobian using plain Python becomes unbearable quickly, I really recommend looking into numpy.random.choice in this case.
Luckily there is method using plain Python which might or might not be sufficiently performant for your purposes, it works great if there aren't that many different weights. You can find the algorithm on page 75&76. It works by splitting up the sampling process into parts with the same inclusion probabilities, i.e. we can use random.sample again! I am not going to explain the principle here since the basics are nicely presented on page 69. Here is the code with hopefully a sufficient amount of comments:
def sample_no_replacement_exact(items, k, best_effort=False, random_=None, ε=1e-9):
"""
Returns a random sample of k elements from items, where items is a list of
tuples (weight, element). The inclusion probability of an element in the
final sample is given by
k * weight / sum(weights).
Note that the function raises if a inclusion probability cannot be
satisfied, e.g the following call is obviously illegal:
sample_no_replacement_exact([(1,'a'),(2,'b')],2)
Since selecting two elements means selecting both all the time,
'b' cannot be selected twice as often as 'a'. In general it can be hard to
spot if the weights are illegal and the function does *not* always raise
an exception in that case. To remedy the situation you can pass
best_effort=True which redistributes the inclusion probability mass
if necessary. Note that the inclusion probabilities will change
if deemed necessary.
The algorithm is based on the splitting procedure on page 75/76 in:
http://www.eustat.eus/productosServicios/52.1_Unequal_prob_sampling.pdf
Additional information can be found here:
https://stackoverflow.com/questions/2140787/
:param items: list of tuples of type weight,element
:param k: length of resulting sample
:param best_effort: fix inclusion probabilities if necessary,
(optional, defaults to False)
:param random_: random module to use (optional, defaults to the
standard random module)
:param ε: fuzziness parameter when testing for zero in the context
of floating point arithmetic (optional, defaults to 1e-9)
:return: random sample set of size k
:exception: throws ValueError in case of bad parameters,
throws AssertionError in case of algorithmic impossibilities
"""
# random_ defaults to the random submodule
if not random_:
random_ = random
# special case empty return set
if k <= 0:
return set()
if k > len(items):
raise ValueError("resulting tuple length exceeds number of elements (k > n)")
# sort items by weight
items = sorted(items, key=lambda item: item[0])
# extract the weights and elements
weights, elements = list(zip(*items))
# compute the inclusion probabilities (short: π) of the elements
scaling_factor = k / sum(weights)
π = [scaling_factor * weight for weight in weights]
# in case of best_effort: if a inclusion probability exceeds 1,
# try to rebalance the probabilities such that:
# a) no probability exceeds 1,
# b) the probabilities still sum to k, and
# c) the probability masses flow from top to bottom:
# [0.2, 0.3, 1.5] -> [0.2, 0.8, 1]
# (remember that π is sorted)
if best_effort and π[-1] > 1 + ε:
# probability mass we still we have to distribute
debt = 0.
for i in reversed(range(len(π))):
if π[i] > 1.:
# an 'offender', take away excess
debt += π[i] - 1.
π[i] = 1.
else:
# case π[i] < 1, i.e. 'save' element
# maximum we can transfer from debt to π[i] and still not
# exceed 1 is computed by the minimum of:
# a) 1 - π[i], and
# b) debt
max_transfer = min(debt, 1. - π[i])
debt -= max_transfer
π[i] += max_transfer
assert debt < ε, "best effort rebalancing failed (impossible)"
# make sure we are talking about probabilities
if any(not (0 - ε <= π_i <= 1 + ε) for π_i in π):
raise ValueError("inclusion probabilities not satisfiable: {}" \
.format(list(zip(π, elements))))
# special case equal probabilities
# (up to fuzziness parameter, remember that π is sorted)
if π[-1] < π[0] + ε:
return set(random_.sample(elements, k))
# compute the two possible lambda values, see formula 7 on page 75
# (remember that π is sorted)
λ1 = π[0] * len(π) / k
λ2 = (1 - π[-1]) * len(π) / (len(π) - k)
λ = min(λ1, λ2)
# there are two cases now, see also page 69
# CASE 1
# with probability λ we are in the equal probability case
# where all elements have the same inclusion probability
if random_.random() < λ:
return set(random_.sample(elements, k))
# CASE 2:
# with probability 1-λ we are in the case of a new sample without
# replacement problem which is strictly simpler,
# it has the following new probabilities (see page 75, π^{(2)}):
new_π = [
(π_i - λ * k / len(π))
/
(1 - λ)
for π_i in π
]
new_items = list(zip(new_π, elements))
# the first few probabilities might be 0, remove them
# NOTE: we make sure that floating point issues do not arise
# by using the fuzziness parameter
while new_items and new_items[0][0] < ε:
new_items = new_items[1:]
# the last few probabilities might be 1, remove them and mark them as selected
# NOTE: we make sure that floating point issues do not arise
# by using the fuzziness parameter
selected_elements = set()
while new_items and new_items[-1][0] > 1 - ε:
selected_elements.add(new_items[-1][1])
new_items = new_items[:-1]
# the algorithm reduces the length of the sample problem,
# it is guaranteed that:
# if λ = λ1: the first item has probability 0
# if λ = λ2: the last item has probability 1
assert len(new_items) < len(items), "problem was not simplified (impossible)"
# recursive call with the simpler sample problem
# NOTE: we have to make sure that the selected elements are included
return sample_no_replacement_exact(
new_items,
k - len(selected_elements),
best_effort=best_effort,
random_=random_,
ε=ε
) | selected_elements
Example:
In : sample_no_replacement_exact([(1,'a'),(2,'b'),(3,'c')],2)
Out: {'b', 'c'}
In : import collections, itertools
In : sample_tester = lambda f: collections.Counter(itertools.chain(*(f() for _ in range(10000))))
In : sample_tester(lambda: sample_no_replacement_exact([(1,'a'),(2,'b'),(3,'c'),(4,'d')],2))
Out: Counter({'a': 2048, 'b': 4051, 'c': 5979, 'd': 7922})
The weights sum up to 10, hence the inclusion probabilities compute to: a → 20%, b → 40%, c → 60%, d → 80%. (Sum: 200% = k.) It works!
Just one word of caution for the productive use of this function, it can be very hard to spot illegal inputs for the weights. An obvious illegal example is
In: sample_no_replacement_exact([(1,'a'),(2,'b')],2)
ValueError: inclusion probabilities not satisfiable: [(0.6666666666666666, 'a'), (1.3333333333333333, 'b')]
b cannot appear twice as often as a since both have to be always be selected. There are more subtle examples. To avoid an exception in production just use best_effort=True, which rebalances the inclusion probability mass such that we have always a valid distribution. Obviously this might change the inclusion probabilities.
I used a associative map (weight,object). for example:
{
(10,"low"),
(100,"mid"),
(10000,"large")
}
total=10110
peek a random number between 0 and 'total' and iterate over the keys until this number fits in a given range.

Checking if two strings are permutations of each other in Python

I'm checking if two strings a and b are permutations of each other, and I'm wondering what the ideal way to do this is in Python. From the Zen of Python, "There should be one -- and preferably only one -- obvious way to do it," but I see there are at least two ways:
sorted(a) == sorted(b)
and
all(a.count(char) == b.count(char) for char in a)
but the first one is slower when (for example) the first char of a is nowhere in b, and the second is slower when they are actually permutations.
Is there any better (either in the sense of more Pythonic, or in the sense of faster on average) way to do it? Or should I just choose from these two depending on which situation I expect to be most common?
Here is a way which is O(n), asymptotically better than the two ways you suggest.
import collections
def same_permutation(a, b):
d = collections.defaultdict(int)
for x in a:
d[x] += 1
for x in b:
d[x] -= 1
return not any(d.itervalues())
## same_permutation([1,2,3],[2,3,1])
#. True
## same_permutation([1,2,3],[2,3,1,1])
#. False
"but the first one is slower when (for example) the first char of a is nowhere in b".
This kind of degenerate-case performance analysis is not a good idea. It's a rat-hole of lost time thinking up all kinds of obscure special cases.
Only do the O-style "overall" analysis.
Overall, the sorts are O( n log( n ) ).
The a.count(char) for char in a solution is O( n 2 ). Each count pass is a full examination of the string.
If some obscure special case happens to be faster -- or slower, that's possibly interesting. But it only matters when you know the frequency of your obscure special cases. When analyzing sort algorithms, it's important to note that a fair number of sorts involve data that's already in the proper order (either by luck or by a clever design), so sort performance on pre-sorted data matters.
In your obscure special case ("the first char of a is nowhere in b") is this frequent enough to matter? If it's just a special case you thought of, set it aside. If it's a fact about your data, then consider it.
heuristically you're probably better to split them off based on string size.
Pseudocode:
returnvalue = false
if len(a) == len(b)
if len(a) < threshold
returnvalue = (sorted(a) == sorted(b))
else
returnvalue = naminsmethod(a, b)
return returnvalue
If performance is critical, and string size can be large or small then this is what I'd do.
It's pretty common to split things like this based on input size or type. Algorithms have different strengths or weaknesses and it would be foolish to use one where another would be better... In this case Namin's method is O(n), but has a larger constant factor than the O(n log n) sorted method.
I think the first one is the "obvious" way. It is shorter, clearer, and likely to be faster in many cases because Python's built-in sort is highly optimized.
Your second example won't actually work:
all(a.count(char) == b.count(char) for char in a)
will only work if b does not contain extra characters not in a. It also does duplicate work if the characters in string a repeat.
If you want to know whether two strings are permutations of the same unique characters, just do:
set(a) == set(b)
To correct your second example:
all(str1.count(char) == str2.count(char) for char in set(a) | set(b))
set() objects overload the bitwise OR operator so that it will evaluate to the union of both sets. This will make sure that you will loop over all the characters of both strings once for each character only.
That said, the sorted() method is much simpler and more intuitive, and would be what I would use.
Here are some timed executions on very small strings, using two different methods:
1. sorting
2. counting (specifically the original method by #namin).
a, b, c = 'confused', 'unfocused', 'foncused'
sort_method = lambda x,y: sorted(x) == sorted(y)
def count_method(a, b):
d = {}
for x in a:
d[x] = d.get(x, 0) + 1
for x in b:
d[x] = d.get(x, 0) - 1
for v in d.itervalues():
if v != 0:
return False
return True
Average run times of the 2 methods over 100,000 loops are:
non-match (string a and b)
$ python -m timeit -s 'import temp' 'temp.sort_method(temp.a, temp.b)'
100000 loops, best of 3: 9.72 usec per loop
$ python -m timeit -s 'import temp' 'temp.count_method(temp.a, temp.b)'
10000 loops, best of 3: 28.1 usec per loop
match (string a and c)
$ python -m timeit -s 'import temp' 'temp.sort_method(temp.a, temp.c)'
100000 loops, best of 3: 9.47 usec per loop
$ python -m timeit -s 'import temp' 'temp.count_method(temp.a, temp.c)'
100000 loops, best of 3: 24.6 usec per loop
Keep in mind that the strings used are very small. The time complexity of the methods are different, so you'll see different results with very large strings. Choose according to your data, you may even use a combination of the two.
Sorry that my code is not in Python, I have never used it, but I am sure this can be easily translated into python. I believe this is faster than all the other examples already posted. It is also O(n), but stops as soon as possible:
public boolean isPermutation(String a, String b) {
if (a.length() != b.length()) {
return false;
}
int[] charCount = new int[256];
for (int i = 0; i < a.length(); ++i) {
++charCount[a.charAt(i)];
}
for (int i = 0; i < b.length(); ++i) {
if (--charCount[b.charAt(i)] < 0) {
return false;
}
}
return true;
}
First I don't use a dictionary but an array of size 256 for all the characters. Accessing the index should be much faster. Then when the second string is iterated, I immediately return false when the count gets below 0. When the second loop has finished, you can be sure that the strings are a permutation, because the strings have equal length and no character was used more often in b compared to a.
Here's martinus code in python. It only works for ascii strings:
def is_permutation(a, b):
if len(a) != len(b):
return False
char_count = [0] * 256
for c in a:
char_count[ord(c)] += 1
for c in b:
char_count[ord(c)] -= 1
if char_count[ord(c)] < 0:
return False
return True
I did a pretty thorough comparison in Java with all words in a book I had. The counting method beats the sorting method in every way. The results:
Testing against 9227 words.
Permutation testing by sorting ... done. 18.582 s
Permutation testing by counting ... done. 14.949 s
If anyone wants the algorithm and test data set, comment away.
First, for solving such problems, e.g. whether String 1 and String 2 are exactly the same or not, easily, you can use an "if" since it is O(1).
Second, it is important to consider that whether they are only numerical values or they can be also words in the string. If the latter one is true (words and numerical values are in the string at the same time), your first solution will not work. You can enhance it by using "ord()" function to make it ASCII numerical value. However, in the end, you are using sort; therefore, in the worst case your time complexity will be O(NlogN). This time complexity is not bad. But, you can do better. You can make it O(N).
My "suggestion" is using Array(list) and set at the same time. Note that finding a value in Array needs iteration so it's time complexity is O(N), but searching a value in set (which I guess it is implemented with HashTable in Python, I'm not sure) has O(1) time complexity:
def Permutation2(Str1, Str2):
ArrStr1 = list(Str1) #convert Str1 to array
SetStr2 = set(Str2) #convert Str2 to set
ArrExtra = []
if len(Str1) != len(Str2): #check their length
return False
elif Str1 == Str2: #check their values
return True
for x in xrange(len(ArrStr1)):
ArrExtra.append(ArrStr1[x])
for x in xrange(len(ArrExtra)): #of course len(ArrExtra) == len(ArrStr1) ==len(ArrStr2)
if ArrExtra[x] in SetStr2: #checking in set is O(1)
continue
else:
return False
return True
Go with the first one - it's much more straightforward and easier to understand. If you're actually dealing with incredibly large strings and performance is a real issue, then don't use Python, use something like C.
As far as the Zen of Python is concerned, that there should only be one obvious way to do things refers to small, simple things. Obviously for any sufficiently complicated task, there will always be zillions of small variations on ways to do it.
In Python 3.1/2.7 you can just use collections.Counter(a) == collections.Counter(b).
But sorted(a) == sorted(b) is still the most obvious IMHO. You are talking about permutations - changing order - so sorting is the obvious operation to erase that difference.
This is derived from #patros' answer.
from collections import Counter
def is_anagram(a, b, threshold=1000000):
"""Returns true if one sequence is a permutation of the other.
Ignores whitespace and character case.
Compares sorted sequences if the length is below the threshold,
otherwise compares dictionaries that contain the frequency of the
elements.
"""
a, b = a.strip().lower(), b.strip().lower()
length_a, length_b = len(a), len(b)
if length_a != length_b:
return False
if length_a < threshold:
return sorted(a) == sorted(b)
return Counter(a) == Counter(b) # Or use #namin's method if you don't want to create two dictionaries and don't mind the extra typing.
This is an O(n) solution in Python using hashing with dictionaries. Notice that I don't use default dictionaries because the code can stop this way if we determine the two strings are not permutations after checking the second letter for instance.
def if_two_words_are_permutations(s1, s2):
if len(s1) != len(s2):
return False
dic = {}
for ch in s1:
if ch in dic.keys():
dic[ch] += 1
else:
dic[ch] = 1
for ch in s2:
if not ch in dic.keys():
return False
elif dic[ch] == 0:
return False
else:
dic[ch] -= 1
return True
This is a PHP function I wrote about a week ago which checks if two words are anagrams. How would this compare (if implemented the same in python) to the other methods suggested? Comments?
public function is_anagram($word1, $word2) {
$letters1 = str_split($word1);
$letters2 = str_split($word2);
if (count($letters1) == count($letters2)) {
foreach ($letters1 as $letter) {
$index = array_search($letter, $letters2);
if ($index !== false) {
unset($letters2[$index]);
}
else { return false; }
}
return true;
}
return false;
}
Here's a literal translation to Python of the PHP version (by JFS):
def is_anagram(word1, word2):
letters2 = list(word2)
if len(word1) == len(word2):
for letter in word1:
try:
del letters2[letters2.index(letter)]
except ValueError:
return False
return True
return False
Comments:
1. The algorithm is O(N**2). Compare it to #namin's version (it is O(N)).
2. The multiple returns in the function look horrible.
This version is faster than any examples presented so far except it is 20% slower than sorted(x) == sorted(y) for short strings. It depends on use cases but generally 20% performance gain is insufficient to justify a complication of the code by using different version for short and long strings (as in #patros's answer).
It doesn't use len so it accepts any iterable therefore it works even for data that do not fit in memory e.g., given two big text files with many repeated lines it answers whether the files have the same lines (lines can be in any order).
def isanagram(iterable1, iterable2):
d = {}
get = d.get
for c in iterable1:
d[c] = get(c, 0) + 1
try:
for c in iterable2:
d[c] -= 1
return not any(d.itervalues())
except KeyError:
return False
It is unclear why this version is faster then defaultdict (#namin's) one for large iterable1 (tested on 25MB thesaurus).
If we replace get in the loop by try: ... except KeyError then it performs 2 times slower for short strings i.e. when there are few duplicates.
In Swift (or another languages implementation), you could look at the encoded values ( in this case Unicode) and see if they match.
Something like:
let string1EncodedValues = "Hello".unicodeScalars.map() {
//each encoded value
$0
//Now add the values
}.reduce(0){ total, value in
total + value.value
}
let string2EncodedValues = "oellH".unicodeScalars.map() {
$0
}.reduce(0) { total, value in
total + value.value
}
let equalStrings = string1EncodedValues == string2EncodedValues ? true : false
You will need to handle spaces and cases as needed.
def matchPermutation(s1, s2):
a = []
b = []
if len(s1) != len(s2):
print 'length should be the same'
return
for i in range(len(s1)):
a.append(s1[i])
for i in range(len(s2)):
b.append(s2[i])
if set(a) == set(b):
print 'Permutation of each other'
else:
print 'Not a permutation of each other'
return
#matchPermutaion('rav', 'var') #returns True
matchPermutaion('rav', 'abc') #returns False
Checking if two strings are permutations of each other in Python
# First method
def permutation(s1,s2):
if len(s1) != len(s2):return False;
return ' '.join(sorted(s1)) == ' '.join(sorted(s2))
# second method
def permutation1(s1,s2):
if len(s1) != len(s2):return False;
array = [0]*128;
for c in s1:
array[ord(c)] +=1
for c in s2:
array[ord(c)] -=1
if (array[ord(c)]) < 0:
return False
return True
How about something like this. Pretty straight-forward and readable. This is for strings since the as per the OP.
Given that the complexity of sorted() is O(n log n).
def checkPermutation(a,b):
# input: strings a and b
# return: boolean true if a is Permutation of b
if len(a) != len(b):
return False
else:
s_a = ''.join(sorted(a))
s_b = ''.join(sorted(b))
if s_a == s_b:
return True
else:
return False
# test inputs
a = 'sRF7w0qbGp4fdgEyNlscUFyouETaPHAiQ2WIxzohiafEGJLw03N8ALvqMw6reLN1kHRjDeDausQBEuIWkIBfqUtsaZcPGoqAIkLlugTxjxLhkRvq5d6i55l4oBH1QoaMXHIZC5nA0K5KPBD9uIwa789sP0ZKV4X6'
b = 'Vq3EeiLGfsAOH2PW6skMN8mEmUAtUKRDIY1kow9t1vIEhe81318wSMICGwf7Rv2qrLrpbeh8bh4hlRLZXDSMyZJYWfejLND4u9EhnNI51DXcQKrceKl9arWqOl7sWIw3EBkeu7Fw4TmyfYwPqCf6oUR0UIdsAVNwbyyiajtQHKh2EKLM1KlY6NdvQTTA7JKn6bLInhFvwZ4yKKbzkgGhF3Oogtnmzl29fW6Q2p0GPuFoueZ74aqlveGTYc0zcXUJkMzltzohoRdMUKP4r5XhbsGBED8ReDbL3ouPhsFchERvvNuaIWLUCY4gl8OW06SMuvceZrCg7EkSFxxprYurHz7VQ2muxzQHj7RG2k3khxbz2ZAhWIlBBtPtg4oXIQ7cbcwgmBXaTXSBgBe3Y8ywYBjinjEjRJjVAiZkWoPrt8JtZv249XiN0MTVYj0ZW6zmcvjZtRn32U3KLMOdjLnRFUP2I3HJtp99uVlM9ghIpae0EfC0v2g78LkZE1YAKsuqCiiy7DVOhyAZUbOrRwXOEDHxUyXwCmo1zfVkPVhwysx8HhH7Iy0yHAMr0Tb97BqcpmmyBsrSgsV1aT3sjY0ctDLibrxbRXBAOexncqB4BBKWJoWkQwUZkFfPXemZkWYmE72w5CFlI6kuwBQp27dCDZ39kRG7Txs1MbsUnRnNHBy1hSOZvTQRYZPX0VmU8SVGUqzwm1ECBHZakQK4RUquk3txKCqbDfbrNmnsEcjFaiMFWkY3Esg6p3Mm41KWysTpzN6287iXjgGSBw6CBv0hH635WiZ0u47IpUD5mY9rkraDDl5sDgd3f586EWJdKAaou3jR7eYU7YuJT3RQVRI0MuS0ec0xYID3WTUI0ckImz2ck7lrtfnkewzRMZSE2ANBkEmg2XAmwrCv0gy4ExW5DayGRXoqUv06ZLGCcBEiaF0fRMlinhElZTVrGPqqhT03WSq4P97JbXA90zUxiHCnsPjuRTthYl7ZaiVZwNt3RtYT4Ff1VQ5KXRwRzdzkRMsubBX7YEhhtl0ZGVlYiP4N4t00Jr7fB4687eabUqK6jcUVpXEpTvKDbj0JLcLYsneM9fsievUz193f6aMQ5o5fm4Ilx3TUZiX4AUsoyd8CD2SK3NkiLuR255BDIA0Zbgnj2XLyQPiJ1T4fjStpjxKOTzsQsZxpThY9Fvjvoxcs3HAiXjLtZ0TSOX6n4ZLjV3TdJMc4PonwqIb3lAndlTMnuzEPof2dXnpexoVm5c37XQ7fBkoMBJ4ydnW25XKYJbkrueRDSwtJGHjY37dob4jPg0axM5uWbqGocXQ4DyiVm5GhvuYX32RQaOtXXXw8cWK6JcSUnlP1gGLMNZEGeDXOuGWiy4AJ7SH93ZQ4iPgoxdfCuW0qbsLKT2HopcY9dtBIRzr91wnES9lDL49tpuW77LSt5dGA0YLSeWAaZt9bDrduE0gDZQ2yX4SDvAOn4PMcbFRfTqzdZXONmO7ruBHHb1tVFlBFNc4xkoetDO2s7mpiVG6YR4EYMFIG1hBPh7Evhttb34AQzqImSQm1gyL3O7n3p98Kqb9qqIPbN1kuhtW5mIbIioWW2n7MHY7E5mt0'
print(checkPermutation(a, b)) #optional
def permute(str1,str2):
if sorted(str1) == sorted(str2):
return True
else:
return False
str1="hello"
str2='olehl'
a=permute(str1,str2)
print(a
from collections import defaultdict
def permutation(s1,s2):
h = defaultdict(int)
for ch in s1:
h[ch]+=1
for ch in s2:
h[ch]-=1
for key in h.keys():
if h[key]!=0 or len(s1)!= len(s2):
return False
return True
print(permutation("tictac","tactic"))

Resources