Is there any way to find arithmetic mean "better" than sum()/N?

Is there any way to find arithmetic mean "better" than sum()/N? - algorithm

Suppose we have N numbers(integers, floats, whatever you want) and want to find their arithmetic mean. Simplest method is to sum all values and divide by number of values:
def simple_mean(array[N]): # pseudocode
sum = 0
for i = 1 to N
sum += array[i]
return sum / N
It works fine, but requires big integers.
If we don't want big integers and we are fine with rounding errors, and N is the power of two, we can use 'divide-and-conquer' : ((a+b)/2 + (c+d)/2)/2 = (a+b+c+d)/4, ((a+b+c+d)/4 + (e+f+g+h)/4)/2 = (a+b+c+d+e+f+g+h)/8, so on.
def bisection_average(array[N]):
if N == 1: return array[1]
return (bisection_average(array[:N/2])+bisection_average(array[N/2:]))/2
Any other ways?
PS. playground for lazy

Knuth lists the following method for calculating mean and standard deviation given floating point (original on p. 232 of Vol 2 of The Art of Computer Programming, 1998 edition; my adaptation below avoids special-casing the first iteration):
double M=0, S=0;
for (int i = 0; i < N; ++i)
{
double Mprev = M;
M += (x[i] - M)/(i+1);
S += (x[i] - M)*(x[i] - Mprev);
}
// mean = M
// std dev = sqrt(S/N) or sqrt(S/N+1)
// depending on whether you want population or sample std dev

Here's a way to calculate the mean using just integers without rounding errors and avoiding big intermediate values:
sum = 0
rest = 0
for num in numbers:
sum += num / N
rest += num % N
sum += rest / N
rest = rest % N
return sum, rest

If the big integers are problem... is it ok
a/N + b/N+.... n/N
I mean you're looking just for other ways or the optimal way?

If the array is floating-point data, even the "simple" algorithm suffers from rounding error. Interestingly, in that case, blocking the computation into sqrt(N) sums of length sqrt(N) actually reduces the error in the average case (even though the same number of floating-point roundings are performed).
For integer data, note that you don't need general "big integers"; if you have fewer than 4 billion elements in your array (likely), you only need an integer type 32 bits larger than that the type of the array data. Performing addition on this slightly larger type will pretty much always be faster than doing division or modulus on the type itself. For example, on most 32 bit systems, 64-bit addition is faster than 32-bit division/modulus. This effect will only become more exaggerated as the types become larger.

If you use float you might avoid big integers:
def simple_mean(array[N]):
sum = 0.0 # <---
for i = 1 to N
sum += array[i]
return sum / N

The Kahan algorithm (according to the wikipedia) has better runtime performance (than the pairwise summation) -O(n)- and an O(1) error growth:
function KahanSum(input)
var sum = 0.0
var c = 0.0 // A running compensation for lost low-order bits.
for i = 1 to input.length do
var y = input[i] - c // So far, so good: c is zero.
var t = sum + y // Alas, sum is big, y small, so low-order digits of y are lost.
c = (t - sum) - y // (t - sum) recovers the high-order part of y; subtracting y recovers -(low part of y)
sum = t // Algebraically, c should always be zero. Beware overly-aggressive optimizing compilers!
// Next time around, the lost low part will be added to y in a fresh attempt.
return sum
Its idea is that the low bits of the floating point numbers are summed and corrected independently from the main summation.

Building on Jason S's solution to find a weighted-average and reign back in S's growth.
Using the M finding algorithm given before along with the aggregate formulas for weighted average and population standard deviation:
Avg = Avg(W*X) / Avg(W)
StDev = sqrt(Avg(W*X*X) / Avg(W) - Avg*Avg)
rewrite the code to find the three running averages, then do then the aggregate calculations at the end
function GetPopulationStats{
<#
.SYNOPSIS
Calculate the average, variance, and standard deviation of a weighted data set
.DESCRIPTION
Uses the Knuth method for finding means adapted by Jason S, to calculate the
three averages required by the agregate statistical formulas
.LINK
https://stackoverflow.com/a/1346890/4496560
#>
param(
[decimal[]]$x #Data Points
,[decimal[]]$w #Weights
)
$N = $x.Length
[decimal]$AvgW = 0
[decimal]$AvgWX = 0
[decimal]$AvgWXX = 0
for($i=0; $i -lt $N; $i++){
$AvgW += ($w[$i] - $AvgW) / ($i+1)
$AvgWX += ($w[$i]*$x[$i] - $AvgWX) / ($i+1)
$AvgWXX += ($w[$i]*$x[$i]*$x[$i] - $AvgWXX) / ($i+1)
}
[ordered]#{
N = $N
Avg = ($avg = $AvgWX / $AvgW)
Var = ($var = ($AvgWXX / $AvgW) - ($Avg * $Avg))
StDev = SquareRoot $var
}
}
and then if your language is like Windows PowerShell you'll likely need a higher precision [math]::sqrt() function
function SquareRoot([decimal]$a){
<#
.SYNOPSIS
Find the square-root of $a
.DESCRIPTION
Uses the McDougall-Wotherspoon variant of the Newton-Raphson method to
find the positive zero of:
f(x) = (x * x) - a
f'(x) = 2 * x
.NOTES
ToDo: using a fitted polynomial would likely run faster
#>
$BiCycleX = $PrevX = 0;
$x = $a/2 # guess
$y = ($x * $x) - $a
$xx = $x
$m = $x + $xx
$del = $x - $PrevX
if($del -lt 0){ $del = -$del }
$i = 0
while($del -gt 0 -and $x -ne $BiCycleX){
$BiCycleX = $PrevX;
$PrevX = $x;
$x = $x - ($y / $m)
$y = ($x * $x) - $a
$xx = $x - ($y / $m)
$m = $x + $xx
$del = $x - $PrevX
if($del -lt 0){ $del = -$del }
if(++$i -ge 50){
throw ("invariant sanity fail on step {0}:`r`n x_(n-1) = {1}`r`n x_n = {2}`r`n delta = {3:0.#e0}" -f $i,$PrevX,$x,$del)
}
}
($x + $PrevX) / 2
}
however, if you don't need a weighted solution it should be easy enough to just let w[i]=1 for all i
finally, it doesn't hurt to do a quick sanity check on the code
describe 'tool spot-check' {
context 'testing root calcs' {
$TestCases = #(
#{Value = 0; Expected = 0}
#{Value = 1; Expected = 1}
#{Value = 4; Expected = 2}
#{Value = 9; Expected = 3}
#{Value = 2; Expected = [decimal]'1.4142135623730950488016887242'}
#{Value = (1e14-1); Expected = [decimal]'9999999.99999995'}
)
It 'finds the square root of: <Value>' -TestCases $TestCases {
param($Value,$Expected)
SquareRoot $Value | should be $Expected
}
}
context 'testing stat calcs' {
It 'calculates the values for 1 to 1000' {
$x = 1..1000
$w = #(1) * 1000
$results = GetPopulationStats $x $w
$results.N | should be 1000
$results.Avg | should be 500.5
$results.Var | should be 83333.25
$results.StDev | should be ([decimal]'288.67499025720950043826670416')
}
It 'calculates the values for a test data set' {
$x = #(33,119,37,90,50,94,32,147,86,28,50,80,145,131,121,90,140,170,214,70,124)
$w = #(207,139,25,144,72,162,93,91,109,151,125,87,49,99,210,105,99,169,50,59,22)
$results = GetPopulationStats $x $w
$results.N | should be 21
$results.Avg | should be ([decimal]'94.54433171592412880458756066')
$results.Var | should be ([decimal]'2202.659150711314347179152603')
$results.StDev | should be ([decimal]'46.93249567955356821948311637')
}
}
}

Related

Algorithm for equiprobable random square binary matrices with two non-adjacent non-zeros in each row and column

It would be great if someone could point me towards an algorithm that would allow me to :
create a random square matrix, with entries 0 and 1, such that
every row and every column contain exactly two non-zero entries,
two non-zero entries cannot be adjacent,
all possible matrices are equiprobable.
Right now I manage to achieve points 1 and 2 doing the following : such a matrix can be transformed, using suitable permutations of rows and columns, into a diagonal block matrix with blocks of the form
1 1 0 0 ... 0
0 1 1 0 ... 0
0 0 1 1 ... 0
.............
1 0 0 0 ... 1
So I start from such a matrix using a partition of [0, ..., n-1] and scramble it by permuting rows and columns randomly. Unfortunately, I can't find a way to integrate the adjacency condition, and I am quite sure that my algorithm won't treat all the matrices equally.
Update
I have managed to achieve point 3. The answer was actually straight under my nose : the block matrix I am creating contains all the information needed to take into account the adjacency condition. First some properties and definitions:
a suitable matrix defines permutations of [1, ..., n] that can be build like so: select a 1 in row 1. The column containing this entry contains exactly one other entry equal to 1 on a row a different from 1. Again, row a contains another entry 1 in a column which contains a second entry 1 on a row b, and so on. This starts a permutation 1 -> a -> b ....
For instance, with the following matrix, starting with the marked entry
v
1 0 1 0 0 0 | 1
0 1 0 0 0 1 | 2
1 0 0 1 0 0 | 3
0 0 1 0 1 0 | 4
0 0 0 1 0 1 | 5
0 1 0 0 1 0 | 6
------------+--
1 2 3 4 5 6 |
we get permutation 1 -> 3 -> 5 -> 2 -> 6 -> 4 -> 1.
the cycles of such a permutation lead to the block matrix I mentioned earlier. I also mentioned scrambling the block matrix using arbitrary permutations on the rows and columns to rebuild a matrix compatible with the requirements.
But I was using any permutation, which led to some adjacent non-zero entries. To avoid that, I have to choose permutations that separate rows (and columns) that are adjacent in the block matrix. Actually, to be more precise, if two rows belong to a same block and are cyclically consecutive (the first and last rows of a block are considered consecutive too), then the permutation I want to apply has to move these rows into non-consecutive rows of the final matrix (I will call two rows incompatible in that case).
So the question becomes : How to build all such permutations ?
The simplest idea is to build a permutation progressively by randomly adding rows that are compatible with the previous one. As an example, consider the case n = 6 using partition 6 = 3 + 3 and the corresponding block matrix
1 1 0 0 0 0 | 1
0 1 1 0 0 0 | 2
1 0 1 0 0 0 | 3
0 0 0 1 1 0 | 4
0 0 0 0 1 1 | 5
0 0 0 1 0 1 | 6
------------+--
1 2 3 4 5 6 |
Here rows 1, 2 and 3 are mutually incompatible, as are 4, 5 and 6. Choose a random row, say 3.
We will write a permutation as an array: [2, 5, 6, 4, 3, 1] meaning 1 -> 2, 2 -> 5, 3 -> 6, ... This means that row 2 of the block matrix will become the first row of the final matrix, row 5 will become the second row, and so on.
Now let's build a suitable permutation by choosing randomly a row, say 3:
p = [3, ...]
The next row will then be chosen randomly among the remaining rows that are compatible with 3 : 4, 5and 6. Say we choose 4:
p = [3, 4, ...]
Next choice has to be made among 1 and 2, for instance 1:
p = [3, 4, 1, ...]
And so on: p = [3, 4, 1, 5, 2, 6].
Applying this permutation to the block matrix, we get:
1 0 1 0 0 0 | 3
0 0 0 1 1 0 | 4
1 1 0 0 0 0 | 1
0 0 0 0 1 1 | 5
0 1 1 0 0 0 | 2
0 0 0 1 0 1 | 6
------------+--
1 2 3 4 5 6 |
Doing so, we manage to vertically isolate all non-zero entries. Same has to be done with the columns, for instance by using permutation p' = [6, 3, 5, 1, 4, 2] to finally get
0 1 0 1 0 0 | 3
0 0 1 0 1 0 | 4
0 0 0 1 0 1 | 1
1 0 1 0 0 0 | 5
0 1 0 0 0 1 | 2
1 0 0 0 1 0 | 6
------------+--
6 3 5 1 4 2 |
So this seems to work quite efficiently, but building these permutations needs to be done with caution, because one can easily be stuck: for instance, with n=6 and partition 6 = 2 + 2 + 2, following the construction rules set up earlier can lead to p = [1, 3, 2, 4, ...]. Unfortunately, 5 and 6 are incompatible, so choosing one or the other makes the last choice impossible. I think I've found all situations that lead to a dead end. I will denote by r the set of remaining choices:
p = [..., x, ?], r = {y} with x and y incompatible
p = [..., x, ?, ?], r = {y, z} with y and z being both incompatible with x (no choice can be made)
p = [..., ?, ?], r = {x, y} with x and y incompatible (any choice would lead to situation 1)
p = [..., ?, ?, ?], r = {x, y, z} with x, y and z being cyclically consecutive (choosing x or z would lead to situation 2, choosing y to situation 3)
p = [..., w, ?, ?, ?], r = {x, y, z} with xwy being a 3-cycle (neither x nor y can be chosen, choosing z would lead to situation 3)
p = [..., ?, ?, ?, ?], r = {w, x, y, z} with wxyz being a 4-cycle (any choice would lead to situation 4)
p = [..., ?, ?, ?, ?], r = {w, x, y, z} with xyz being a 3-cycle (choosing w would lead to situation 4, choosing any other would lead to situation 4)
Now it seems that the following algorithm gives all suitable permutations:
As long as there are strictly more than 5 numbers to choose, choose randomly among the compatible ones.
If there are 5 numbers left to choose: if the remaining numbers contain a 3-cycle or a 4-cycle, break that cycle (i.e. choose a number belonging to that cycle).
If there are 4 numbers left to choose: if the remaining numbers contain three cyclically consecutive numbers, choose one of them.
If there are 3 numbers left to choose: if the remaining numbers contain two cyclically consecutive numbers, choose one of them.
I am quite sure that this allows me to generate all suitable permutations and, hence, all suitable matrices.
Unfortunately, every matrix will be obtained several times, depending on the partition that was chosen.

Intro
Here is some prototype-approach, trying to solve the more general task of
uniform combinatorial sampling, which for our approach here means: we can use this approach for everything which we can formulate as SAT-problem.
It's not exploiting your problem directly and takes a heavy detour. This detour to the SAT-problem can help in regards to theory (more powerful general theoretical results) and efficiency (SAT-solvers).
That being said, it's not an approach if you want to sample within seconds or less (in my experiments), at least while being concerned about uniformity.
Theory
The approach, based on results from complexity-theory, follows this work:
GOMES, Carla P.; SABHARWAL, Ashish; SELMAN, Bart. Near-uniform sampling of combinatorial spaces using XOR constraints. In: Advances In Neural Information Processing Systems. 2007. S. 481-488.
The basic idea:
formulate the problem as SAT-problem
add randomly generated xors to the problem (acting on the decision-variables only! that's important in practice)
this will reduce the number of solutions (some solutions will get impossible)
do that in a loop (with tuned parameters) until only one solution is left!
search for some solution is being done by SAT-solvers or #SAT-solvers (=model-counting)
if there is more than one solution: no xors will be added but a complete restart will be done: add random-xors to the start-problem!
The guarantees:
when tuning the parameters right, this approach achieves near-uniform sampling
this tuning can be costly, as it's based on approximating the number of possible solutions
empirically this can also be costly!
Ante's answer, mentioning the number sequence A001499 actually gives a nice upper bound on the solution-space (as it's just ignoring adjacency-constraints!)
The drawbacks:
inefficient for large problems (in general; not necessarily compared to the alternatives like MCMC and co.)
need to change / reduce parameters to produce samples
those reduced parameters lose the theoretical guarantees
but empirically: good results are still possible!
Parameters:
In practice, the parameters are:
N: number of xors added
L: minimum number of variables part of one xor-constraint
U: maximum number of variables part of one xor-constraint
N is important to reduce the number of possible solutions. Given N constant, the other variables of course also have some effect on that.
Theory says (if i interpret correctly), that we should use L = R = 0.5 * #dec-vars.
This is impossible in practice here, as xor-constraints hurt SAT-solvers a lot!
Here some more scientific slides about the impact of L and U.
They call xors of size 8-20 short-XORS, while we will need to use even shorter ones later!
Implementation
Final version
Here is a pretty hacky implementation in python, using the XorSample scripts from here.
The underlying SAT-solver in use is Cryptominisat.
The code basically boils down to:
Transform the problem to conjunctive normal-form
as DIMACS-CNF
Implement the sampling-approach:
Calls XorSample (pipe-based + file-based)
Call SAT-solver (file-based)
Add samples to some file for later analysis
Code: (i hope i did warn you already about the code-quality)
from itertools import count
from time import time
import subprocess
import numpy as np
import os
import shelve
import uuid
import pickle
from random import SystemRandom
cryptogen = SystemRandom()
""" Helper functions """
# K-ARY CONSTRAINT GENERATION
# ###########################
# SINZ, Carsten. Towards an optimal CNF encoding of boolean cardinality constraints.
# CP, 2005, 3709. Jg., S. 827-831.
def next_var_index(start):
next_var = start
while(True):
yield next_var
next_var += 1
class s_index():
def __init__(self, start_index):
self.firstEnvVar = start_index
def next(self,i,j,k):
return self.firstEnvVar + i*k +j
def gen_seq_circuit(k, input_indices, next_var_index_gen):
cnf_string = ''
s_index_gen = s_index(next_var_index_gen.next())
# write clauses of first partial sum (i.e. i=0)
cnf_string += (str(-input_indices[0]) + ' ' + str(s_index_gen.next(0,0,k)) + ' 0\n')
for i in range(1, k):
cnf_string += (str(-s_index_gen.next(0, i, k)) + ' 0\n')
# write clauses for general case (i.e. 0 < i < n-1)
for i in range(1, len(input_indices)-1):
cnf_string += (str(-input_indices[i]) + ' ' + str(s_index_gen.next(i, 0, k)) + ' 0\n')
cnf_string += (str(-s_index_gen.next(i-1, 0, k)) + ' ' + str(s_index_gen.next(i, 0, k)) + ' 0\n')
for u in range(1, k):
cnf_string += (str(-input_indices[i]) + ' ' + str(-s_index_gen.next(i-1, u-1, k)) + ' ' + str(s_index_gen.next(i, u, k)) + ' 0\n')
cnf_string += (str(-s_index_gen.next(i-1, u, k)) + ' ' + str(s_index_gen.next(i, u, k)) + ' 0\n')
cnf_string += (str(-input_indices[i]) + ' ' + str(-s_index_gen.next(i-1, k-1, k)) + ' 0\n')
# last clause for last variable
cnf_string += (str(-input_indices[-1]) + ' ' + str(-s_index_gen.next(len(input_indices)-2, k-1, k)) + ' 0\n')
return (cnf_string, (len(input_indices)-1)*k, 2*len(input_indices)*k + len(input_indices) - 3*k - 1)
# K=2 clause GENERATION
# #####################
def gen_at_most_2_constraints(vars, start_var):
constraint_string = ''
used_clauses = 0
used_vars = 0
index_gen = next_var_index(start_var)
circuit = gen_seq_circuit(2, vars, index_gen)
constraint_string += circuit[0]
used_clauses += circuit[2]
used_vars += circuit[1]
start_var += circuit[1]
return [constraint_string, used_clauses, used_vars, start_var]
def gen_at_least_2_constraints(vars, start_var):
k = len(vars) - 2
vars = [-var for var in vars]
constraint_string = ''
used_clauses = 0
used_vars = 0
index_gen = next_var_index(start_var)
circuit = gen_seq_circuit(k, vars, index_gen)
constraint_string += circuit[0]
used_clauses += circuit[2]
used_vars += circuit[1]
start_var += circuit[1]
return [constraint_string, used_clauses, used_vars, start_var]
# Adjacency conflicts
# ###################
def get_all_adjacency_conflicts_4_neighborhood(N, X):
conflicts = set()
for x in range(N):
for y in range(N):
if x < (N-1):
conflicts.add(((x,y),(x+1,y)))
if y < (N-1):
conflicts.add(((x,y),(x,y+1)))
cnf = '' # slow string appends
for (var_a, var_b) in conflicts:
var_a_ = X[var_a]
var_b_ = X[var_b]
cnf += '-' + var_a_ + ' ' + '-' + var_b_ + ' 0 \n'
return cnf, len(conflicts)
# Build SAT-CNF
#############
def build_cnf(N, verbose=False):
var_counter = count(1)
N_CLAUSES = 0
X = np.zeros((N, N), dtype=object)
for a in range(N):
for b in range(N):
X[a,b] = str(next(var_counter))
# Adjacency constraints
CNF, N_CLAUSES = get_all_adjacency_conflicts_4_neighborhood(N, X)
# k=2 constraints
NEXT_VAR = N*N+1
for row in range(N):
constraint_string, used_clauses, used_vars, NEXT_VAR = gen_at_most_2_constraints(X[row, :].astype(int).tolist(), NEXT_VAR)
N_CLAUSES += used_clauses
CNF += constraint_string
constraint_string, used_clauses, used_vars, NEXT_VAR = gen_at_least_2_constraints(X[row, :].astype(int).tolist(), NEXT_VAR)
N_CLAUSES += used_clauses
CNF += constraint_string
for col in range(N):
constraint_string, used_clauses, used_vars, NEXT_VAR = gen_at_most_2_constraints(X[:, col].astype(int).tolist(), NEXT_VAR)
N_CLAUSES += used_clauses
CNF += constraint_string
constraint_string, used_clauses, used_vars, NEXT_VAR = gen_at_least_2_constraints(X[:, col].astype(int).tolist(), NEXT_VAR)
N_CLAUSES += used_clauses
CNF += constraint_string
# build final cnf
CNF = 'p cnf ' + str(NEXT_VAR-1) + ' ' + str(N_CLAUSES) + '\n' + CNF
return X, CNF, NEXT_VAR-1
# External tools
# ##############
def get_random_xor_problem(CNF_IN_fp, N_DEC_VARS, N_ALL_VARS, s, min_l, max_l):
# .cnf not part of arg!
p = subprocess.Popen(['./gen-wff', CNF_IN_fp,
str(N_DEC_VARS), str(N_ALL_VARS),
str(s), str(min_l), str(max_l), 'xored'],
stdin=subprocess.PIPE, stdout=subprocess.PIPE, stderr=subprocess.PIPE)
result = p.communicate()
os.remove(CNF_IN_fp + '-str-xored.xor') # file not needed
return CNF_IN_fp + '-str-xored.cnf'
def solve(CNF_IN_fp, N_DEC_VARS):
seed = cryptogen.randint(0, 2147483647) # actually no reason to do it; but can't hurt either
p = subprocess.Popen(["./cryptominisat5", '-t', '4', '-r', str(seed), CNF_IN_fp], stdin=subprocess.PIPE, stdout=subprocess.PIPE)
result = p.communicate()[0]
sat_line = result.find('s SATISFIABLE')
if sat_line != -1:
# solution found!
vars = parse_solution(result)[:N_DEC_VARS]
# forbid solution (DeMorgan)
negated_vars = list(map(lambda x: x*(-1), vars))
with open(CNF_IN_fp, 'a') as f:
f.write( (str(negated_vars)[1:-1] + ' 0\n').replace(',', ''))
# assume solve is treating last constraint despite not changing header!
# solve again
seed = cryptogen.randint(0, 2147483647)
p = subprocess.Popen(["./cryptominisat5", '-t', '4', '-r', str(seed), CNF_IN_fp], stdin=subprocess.PIPE, stdout=subprocess.PIPE)
result = p.communicate()[0]
sat_line = result.find('s SATISFIABLE')
if sat_line != -1:
os.remove(CNF_IN_fp) # not needed anymore
return True, False, None
else:
return True, True, vars
else:
return False, False, None
def parse_solution(output):
# assumes there is one
vars = []
for line in output.split("\n"):
if line:
if line[0] == 'v':
line_vars = list(map(lambda x: int(x), line.split()[1:]))
vars.extend(line_vars)
return vars
# Core-algorithm
# ##############
def xorsample(X, CNF_IN_fp, N_DEC_VARS, N_VARS, s, min_l, max_l):
start_time = time()
while True:
# add s random XOR constraints to F
xored_cnf_fp = get_random_xor_problem(CNF_IN_fp, N_DEC_VARS, N_VARS, s, min_l, max_l)
state_lvl1, state_lvl2, var_sol = solve(xored_cnf_fp, N_DEC_VARS)
print('------------')
if state_lvl1 and state_lvl2:
print('FOUND')
d = shelve.open('N_15_70_4_6_TO_PLOT')
d[str(uuid.uuid4())] = (pickle.dumps(var_sol), time() - start_time)
d.close()
return True
else:
if state_lvl1:
print('sol not unique')
else:
print('no sol found')
print('------------')
""" Run """
N = 15
N_DEC_VARS = N*N
X, CNF, N_VARS = build_cnf(N)
with open('my_problem.cnf', 'w') as f:
f.write(CNF)
counter = 0
while True:
print('sample: ', counter)
xorsample(X, 'my_problem', N_DEC_VARS, N_VARS, 70, 4, 6)
counter += 1
Output will look like (removed some warnings):
------------
no sol found
------------
------------
no sol found
------------
------------
no sol found
------------
------------
sol not unique
------------
------------
FOUND
Core: CNF-formulation
We introduce one variable for every cell of the matrix. N=20 means 400 binary-variables.
Adjancency:
Precalculate all symmetry-reduced conflicts and add conflict-clauses.
Basic theory:
a -> !b
<->
!a v !b (propositional logic)
Row/Col-wise Cardinality:
This is tough to express in CNF and naive approaches need an exponential number
of constraints.
We use some adder-circuit based encoding (SINZ, Carsten. Towards an optimal CNF encoding of boolean cardinality constraints) which introduces new auxiliary-variables.
Remark:
sum(var_set) <= k
<->
sum(negated(var_set)) >= len(var_set) - k
These SAT-encodings can be put into exact model-counters (for small N; e.g. < 9). The number of solutions equals Ante's results, which is a strong indication for a correct transformation!
There are also interesting approximate model-counters (also heavily based on xor-constraints) like approxMC which shows one more thing we can do with the SAT-formulation. But in practice i have not been able to use these (approxMC = autoconf; no comment).
Other experiments
I did also build a version using pblib, to use more powerful cardinality-formulations
for the SAT-CNF formulation. I did not try to use the C++-based API, but only the reduced pbencoder, which automatically selects some best encoding, which was way worse than my encoding used above (which is best is still a research-problem; often even redundant-constraints can help).
Empirical analysis
For the sake of obtaining some sample-size (given my patience), i only computed samples for N=15. In this case we used:
N=70 xors
L,U = 4,6
I also computed some samples for N=20 with (100,3,6), but this takes a few mins and we reduced the lower bound!
Visualization
Here some animation (strengthening my love-hate relationship with matplotlib):
Edit: And a (reduced) comparison to brute-force uniform-sampling with N=5 (NXOR,L,U = 4, 10, 30):
(I have not yet decided on the addition of the plotting-code. It's as ugly as the above one and people might look too much into my statistical shambles; normalizations and co.)
Theory
Statistical analysis is probably hard to do as the underlying problem is of such combinatoric nature. It's even not entirely obvious how that final cell-PDF should look like. In the case of N=odd, it's probably non-uniform and looks like a chess-board (i did brute-force check N=5 to observe this).
One thing we can be sure about (imho): symmetry!
Given a cell-PDF matrix, we should expect, that the matrix is symmetric (A = A.T).
This is checked in the visualization and the euclidean-norm of differences over time is plotted.
We can do the same on some other observation: observed pairings.
For N=3, we can observe the following pairs:
0,1
0,2
1,2
Now we can do this per-row and per-column and should expect symmetry too!
Sadly, it's probably not easy to say something about the variance and therefore the needed samples to speak about confidence!
Observation
According to my simplified perception, current-samples and the cell-PDF look good, although convergence is not achieved yet (or we are far away from uniformity).
The more important aspect are probably the two norms, nicely decreasing towards 0.
(yes; one could tune some algorithm for that by transposing with prob=0.5; but this is not done here as it would defeat it's purpose).
Potential next steps
Tune parameters
Check out the approach using #SAT-solvers / Model-counters instead of SAT-solvers
Try different CNF-formulations, especially in regards to cardinality-encodings and xor-encodings
XorSample is by default using tseitin-like encoding to get around exponentially grow
for smaller xors (as used) it might be a good idea to use naive encoding (which propagates faster)
XorSample supports that in theory; but the script's work differently in practice
Cryptominisat is known for dedicated XOR-handling (as it was build for analyzing cryptography including many xors) and might gain something by naive encoding (as inferring xors from blown-up CNFs is much harder)
More statistical-analysis
Get rid of XorSample scripts (shell + perl...)
Summary
The approach is very general
This code produces feasible samples
It should be not hard to prove, that every feasible solution can be sampled
Others have proven theoretical guarantees for uniformity for some params
does not hold for our params
Others have empirically / theoretically analyzed smaller parameters (in use here)

(Updated test results, example run-through and code snippets below.)
You can use dynamic programming to calculate the number of solutions resulting from every state (in a much more efficient way than a brute-force algorithm), and use those (pre-calculated) values to create equiprobable random solutions.
Consider the example of a 7x7 matrix; at the start, the state is:
0,0,0,0,0,0,0
meaning that there are seven adjacent unused columns. After adding two ones to the first row, the state could be e.g.:
0,1,0,0,1,0,0
with two columns that now have a one in them. After adding ones to the second row, the state could be e.g.:
0,1,1,0,1,0,1
After three rows are filled, there is a possibility that a column will have its maximum of two ones; this effectively splits the matrix into two independent zones:
1,1,1,0,2,0,1 -> 1,1,1,0 + 0,1
These zones are independent in the sense that the no-adjacent-ones rule has no effect when adding ones to different zones, and the order of the zones has no effect on the number of solutions.
In order to use these states as signatures for types of solutions, we have to transform them into a canonical notation. First, we have to take into account the fact that columns with only 1 one in them may be unusable in the next row, because they contain a one in the current row. So instead of a binary notation, we have to use a ternary notation, e.g.:
2,1,1,0 + 0,1
where the 2 means that this column was used in the current row (and not that there are 2 ones in the column). At the next step, we should then convert the twos back into ones.
Additionally, we can also mirror the seperate groups to put them into their lexicographically smallest notation:
2,1,1,0 + 0,1 -> 0,1,1,2 + 0,1
Lastly, we sort the seperate groups from small to large, and then lexicographically, so that a state in a larger matrix may be e.g.:
0,0 + 0,1 + 0,0,2 + 0,1,0 + 0,1,0,1
Then, when calculating the number of solutions resulting from each state, we can use memoization using the canonical notation of each state as a key.
Creating a dictionary of the states and the number of solutions for each of them only needs to be done once, and a table for larger matrices can probably be used for smaller matrices too.
Practically, you'd generate a random number between 0 and the total number of solutions, and then for every row, you'd look at the different states you could create from the current state, look at the number of unique solutions each one would generate, and see which option leads to the solution that corresponds with your randomly generated number.
Note that every state and the corresponding key can only occur in a particular row, so you can store the keys in seperate dictionaries per row.
TEST RESULTS
A first test using unoptimized JavaScript gave very promising results. With dynamic programming, calculating the number of solutions for a 10x10 matrix now takes a second, where a brute-force algorithm took several hours (and this is the part of the algorithm that only needs to be done once). The size of the dictionary with the signatures and numbers of solutions grows with a diminishing factor approaching 2.5 for each step in size; the time to generate it grows with a factor of around 3.
These are the number of solutions, states, signatures (total size of the dictionaries), and maximum number of signatures per row (largest dictionary per row) that are created:
size unique solutions states signatures max/row
4x4 2 9 6 2
5x5 16 73 26 8
6x6 722 514 107 40
7x7 33,988 2,870 411 152
8x8 2,215,764 13,485 1,411 596
9x9 179,431,924 56,375 4,510 1,983
10x10 17,849,077,140 218,038 13,453 5,672
11x11 2,138,979,146,276 801,266 38,314 14,491
12x12 304,243,884,374,412 2,847,885 104,764 35,803
13x13 50,702,643,217,809,908 9,901,431 278,561 96,414
14x14 9,789,567,606,147,948,364 33,911,578 723,306 238,359
15x15 2,168,538,331,223,656,364,084 114,897,838 1,845,861 548,409
16x16 546,386,962,452,256,865,969,596 ... 4,952,501 1,444,487
17x17 155,420,047,516,794,379,573,558,433 12,837,870 3,754,040
18x18 48,614,566,676,379,251,956,711,945,475 31,452,747 8,992,972
19x19 17,139,174,923,928,277,182,879,888,254,495 74,818,773 20,929,008
20x20 6,688,262,914,418,168,812,086,412,204,858,650 175,678,000 50,094,203
(Additional results obtained with C++, using a simple 128-bit integer implementation. To count the states, the code had to be run using each state as a seperate signature, which I was unable to do for the largest sizes. )
EXAMPLE
The dictionary for a 5x5 matrix looks like this:
row 0: 00000 -> 16 row 3: 101 -> 0
1112 -> 1
row 1: 20002 -> 2 1121 -> 1
00202 -> 4 1+01 -> 0
02002 -> 2 11+12 -> 2
02020 -> 2 1+121 -> 1
0+1+1 -> 0
row 2: 10212 -> 1 1+112 -> 1
12012 -> 1
12021 -> 2 row 4: 0 -> 0
12102 -> 1 11 -> 0
21012 -> 0 12 -> 0
02121 -> 3 1+1 -> 1
01212 -> 1 1+2 -> 0
The total number of solutions is 16; if we randomly pick a number from 0 to 15, e.g. 13, we can find the corresponding (i.e. the 14th) solution like this:
state: 00000
options: 10100 10010 10001 01010 01001 00101
signature: 00202 02002 20002 02020 02002 00202
solutions: 4 2 2 2 2 4
This tells us that the 14th solution is the 2nd solution of option 00101. The next step is:
state: 00101
options: 10010 01010
signature: 12102 02121
solutions: 1 3
This tells us that the 2nd solution is the 1st solution of option 01010. The next step is:
state: 01111
options: 10100 10001 00101
signature: 11+12 1112 1+01
solutions: 2 1 0
This tells us that the 1st solution is the 1st solution of option 10100. The next step is:
state: 11211
options: 01010 01001
signature: 1+1 1+1
solutions: 1 1
This tells us that the 1st solutions is the 1st solution of option 01010. The last step is:
state: 12221
options: 10001
And the 5x5 matrix corresponding to randomly chosen number 13 is:
0 0 1 0 1
0 1 0 1 0
1 0 1 0 0
0 1 0 1 0
1 0 0 0 1
And here's a quick'n'dirty code example; run the snippet to generate the signature and solution count dictionary, and generate a random 10x10 matrix (it takes a second to generate the dictionary; once that is done, it generates random solutions in half a millisecond):
function signature(state, prev) {
var zones = [], zone = [];
for (var i = 0; i < state.length; i++) {
if (state[i] == 2) {
if (zone.length) zones.push(mirror(zone));
zone = [];
}
else if (prev[i]) zone.push(3);
else zone.push(state[i]);
}
if (zone.length) zones.push(mirror(zone));
zones.sort(function(a,b) {return a.length - b.length || a - b;});
return zones.length ? zones.join("2") : "2";
function mirror(zone) {
var ltr = zone.join('');
zone.reverse();
var rtl = zone.join('');
return (ltr < rtl) ? ltr : rtl;
}
}
function memoize(n) {
var memo = [], empty = [];
for (var i = 0; i <= n; i++) memo[i] = [];
for (var i = 0; i < n; i++) empty[i] = 0;
memo[0][signature(empty, empty)] = next_row(empty, empty, 1);
return memo;
function next_row(state, prev, row) {
if (row > n) return 1;
var solutions = 0;
for (var i = 0; i < n - 2; i++) {
if (state[i] == 2 || prev[i] == 1) continue;
for (var j = i + 2; j < n; j++) {
if (state[j] == 2 || prev[j] == 1) continue;
var s = state.slice(), p = empty.slice();
++s[i]; ++s[j]; ++p[i]; ++p[j];
var sig = signature(s, p);
var sol = memo[row][sig];
if (sol == undefined)
memo[row][sig] = sol = next_row(s, p, row + 1);
solutions += sol;
}
}
return solutions;
}
}
function random_matrix(n, memo) {
var matrix = [], empty = [], state = [], prev = [];
for (var i = 0; i < n; i++) empty[i] = state[i] = prev[i] = 0;
var total = memo[0][signature(empty, empty)];
var pick = Math.floor(Math.random() * total);
document.write("solution " + pick.toLocaleString('en-US') +
" from a total of " + total.toLocaleString('en-US') + "<br>");
for (var row = 1; row <= n; row++) {
var options = find_options(state, prev);
for (var i in options) {
var state_copy = state.slice();
for (var j in state_copy) state_copy[j] += options[i][j];
var sig = signature(state_copy, options[i]);
var solutions = memo[row][sig];
if (pick < solutions) {
matrix.push(options[i].slice());
prev = options[i].slice();
state = state_copy.slice();
break;
}
else pick -= solutions;
}
}
return matrix;
function find_options(state, prev) {
var options = [];
for (var i = 0; i < n - 2; i++) {
if (state[i] == 2 || prev[i] == 1) continue;
for (var j = i + 2; j < n; j++) {
if (state[j] == 2 || prev[j] == 1) continue;
var option = empty.slice();
++option[i]; ++option[j];
options.push(option);
}
}
return options;
}
}
var size = 10;
var memo = memoize(size);
var matrix = random_matrix(size, memo);
for (var row in matrix) document.write(matrix[row] + "<br>");
The code snippet below shows the dictionary of signatures and solution counts for a matrix of size 10x10. I've used a slightly different signature format from the explanation above: the zones are delimited by a '2' instead of a plus sign, and a column which has a one in the previous row is marked with a '3' instead of a '2'. This shows how the keys could be stored in a file as integers with 2×N bits (padded with 2's).
function signature(state, prev) {
var zones = [], zone = [];
for (var i = 0; i < state.length; i++) {
if (state[i] == 2) {
if (zone.length) zones.push(mirror(zone));
zone = [];
}
else if (prev[i]) zone.push(3);
else zone.push(state[i]);
}
if (zone.length) zones.push(mirror(zone));
zones.sort(function(a,b) {return a.length - b.length || a - b;});
return zones.length ? zones.join("2") : "2";
function mirror(zone) {
var ltr = zone.join('');
zone.reverse();
var rtl = zone.join('');
return (ltr < rtl) ? ltr : rtl;
}
}
function memoize(n) {
var memo = [], empty = [];
for (var i = 0; i <= n; i++) memo[i] = [];
for (var i = 0; i < n; i++) empty[i] = 0;
memo[0][signature(empty, empty)] = next_row(empty, empty, 1);
return memo;
function next_row(state, prev, row) {
if (row > n) return 1;
var solutions = 0;
for (var i = 0; i < n - 2; i++) {
if (state[i] == 2 || prev[i] == 1) continue;
for (var j = i + 2; j < n; j++) {
if (state[j] == 2 || prev[j] == 1) continue;
var s = state.slice(), p = empty.slice();
++s[i]; ++s[j]; ++p[i]; ++p[j];
var sig = signature(s, p);
var sol = memo[row][sig];
if (sol == undefined)
memo[row][sig] = sol = next_row(s, p, row + 1);
solutions += sol;
}
}
return solutions;
}
}
var memo = memoize(10);
for (var i in memo) {
document.write("row " + i + ":<br>");
for (var j in memo[i]) {
document.write(""" + j + "": " + memo[i][j] + "<br>");
}
}

Just few thoughts. Number of matrices satisfying conditions for n <= 10:
3 0
4 2
5 16
6 722
7 33988
8 2215764
9 179431924
10 17849077140
Unfortunatelly there is no sequence with these numbers in OEIS.
There is one similar (A001499), without condition for neighbouring one's. Number of nxn matrices in this case is 'of order' as A001499's number of (n-1)x(n-1) matrices. That is to be expected since number
of ways to fill one row in this case, position 2 one's in n places with at least one zero between them is ((n-1) choose 2). Same as to position 2 one's in (n-1) places without the restriction.
I don't think there is an easy connection between these matrix of order n and A001499 matrix of order n-1, meaning that if we have A001499 matrix than we can construct some of these matrices.
With this, for n=20, number of matrices is >10^30. Quite a lot :-/

This solution use recursion in order to set the cell of the matrix one by one. If the random walk finish with an impossible solution then we rollback one step in the tree and we continue the random walk.
The algorithm is efficient and i think that the generated data are highly equiprobable.
package rndsqmatrix;
import java.util.ArrayList;
import java.util.Collections;
import java.util.List;
import java.util.stream.IntStream;
public class RndSqMatrix {
/**
* Generate a random matrix
* #param size the size of the matrix
* #return the matrix encoded in 1d array i=(x+y*size)
*/
public static int[] generate(final int size) {
return generate(size, new int[size * size], new int[size],
new int[size]);
}
/**
* Build a matrix recursivly with a random walk
* #param size the size of the matrix
* #param matrix the matrix encoded in 1d array i=(x+y*size)
* #param rowSum
* #param colSum
* #return
*/
private static int[] generate(final int size, final int[] matrix,
final int[] rowSum, final int[] colSum) {
// generate list of valid positions
final List<Integer> positions = new ArrayList();
for (int y = 0; y < size; y++) {
if (rowSum[y] < 2) {
for (int x = 0; x < size; x++) {
if (colSum[x] < 2) {
final int p = x + y * size;
if (matrix[p] == 0
&& (x == 0 || matrix[p - 1] == 0)
&& (x == size - 1 || matrix[p + 1] == 0)
&& (y == 0 || matrix[p - size] == 0)
&& (y == size - 1 || matrix[p + size] == 0)) {
positions.add(p);
}
}
}
}
}
// no valid positions ?
if (positions.isEmpty()) {
// if the matrix is incomplete => return null
for (int i = 0; i < size; i++) {
if (rowSum[i] != 2 || colSum[i] != 2) {
return null;
}
}
// the matrix is complete => return it
return matrix;
}
// random walk
Collections.shuffle(positions);
for (int p : positions) {
// set '1' and continue recursivly the exploration
matrix[p] = 1;
rowSum[p / size]++;
colSum[p % size]++;
final int[] solMatrix = generate(size, matrix, rowSum, colSum);
if (solMatrix != null) {
return solMatrix;
}
// rollback
matrix[p] = 0;
rowSum[p / size]--;
colSum[p % size]--;
}
// we can't find a valid matrix from here => return null
return null;
}
public static void printMatrix(final int size, final int[] matrix) {
for (int y = 0; y < size; y++) {
for (int x = 0; x < size; x++) {
System.out.print(matrix[x + y * size]);
System.out.print(" ");
}
System.out.println();
}
}
public static void printStatistics(final int size, final int count) {
final int sumMatrix[] = new int[size * size];
for (int i = 0; i < count; i++) {
final int[] matrix = generate(size);
for (int j = 0; j < sumMatrix.length; j++) {
sumMatrix[j] += matrix[j];
}
}
printMatrix(size, sumMatrix);
}
public static void checkAlgorithm() {
final int size = 8;
final int count = 2215764;
final int divisor = 122;
final int sumMatrix[] = new int[size * size];
for (int i = 0; i < count/divisor ; i++) {
final int[] matrix = generate(size);
for (int j = 0; j < sumMatrix.length; j++) {
sumMatrix[j] += matrix[j];
}
}
int total = 0;
for(int i=0; i < sumMatrix.length; i++) {
total += sumMatrix[i];
}
final double factor = (double)total / (count/divisor);
System.out.println("Factor=" + factor + " (theory=16.0)");
}
public static void benchmark(final int size, final int count,
final boolean parallel) {
final long begin = System.currentTimeMillis();
if (!parallel) {
for (int i = 0; i < count; i++) {
generate(size);
}
} else {
IntStream.range(0, count).parallel().forEach(i -> generate(size));
}
final long end = System.currentTimeMillis();
System.out.println("rate="
+ (double) (end - begin) / count + "ms/matrix");
}
public static void main(String[] args) {
checkAlgorithm();
benchmark(8, 10000, true);
//printStatistics(8, 2215764/36);
printStatistics(8, 2215764);
}
}
The output is:
Factor=16.0 (theory=16.0)
rate=0.2835ms/matrix
552969 554643 552895 554632 555680 552753 554567 553389
554071 554847 553441 553315 553425 553883 554485 554061
554272 552633 555130 553699 553604 554298 553864 554028
554118 554299 553565 552986 553786 554473 553530 554771
554474 553604 554473 554231 553617 553556 553581 553992
554960 554572 552861 552732 553782 554039 553921 554661
553578 553253 555721 554235 554107 553676 553776 553182
553086 553677 553442 555698 553527 554850 553804 553444

Here is a very fast approach of generating the matrix row by row, written in Java:
public static void main(String[] args) throws Exception {
int n = 100;
Random rnd = new Random();
byte[] mat = new byte[n*n];
byte[] colCount = new byte[n];
//generate row by row
for (int x = 0; x < n; x++) {
//generate a random first bit
int b1 = rnd.nextInt(n);
while ( (x > 0 && mat[(x-1)*n + b1] == 1) || //not adjacent to the one above
(colCount[b1] == 2) //not in a column which has 2
) b1 = rnd.nextInt(n);
//generate a second bit, not equal to the first one
int b2 = rnd.nextInt(n);
while ( (b2 == b1) || //not the same as bit 1
(x > 0 && mat[(x-1)*n + b2] == 1) || //not adjacent to the one above
(colCount[b2] == 2) || //not in a column which has 2
(b2 == b1 - 1) || //not adjacent to b1
(b2 == b1 + 1)
) b2 = rnd.nextInt(n);
//fill the matrix values and increment column counts
mat[x*n + b1] = 1;
mat[x*n + b2] = 1;
colCount[b1]++;
colCount[b2]++;
}
String arr = Arrays.toString(mat).substring(1, n*n*3 - 1);
System.out.println(arr.replaceAll("(.{" + n*3 + "})", "$1\n"));
}
It essentially generates each a random row at a time. If the row will violate any of the conditions, it is generated again (again randomly). I believe this will satisfy condition 4 as well.
Adding a quick note that it will spin forever for N-s where there is no solutions (like N=3).

Can this operation be vectorized in Octave?

I have a matrix A and B. I want to take the sum of squares errors between them ss = sum(sum( (A-B).^2 )), but I only want to do so if NEITHER matrix elements are identically zero. For now, I am going through each matrix as follows:
for i = 1:N
for j = 1:M
if( A(i,j) == 0 )
B(i,j) = 0;
elseif( B(i,j) == 0 )
A(i,j) = 0;
end
end
end
and then taking the sum of squares after that. Is there a way to vectorize the comparison and reassigning of values?

If you were just trying to achieve what the listed code is doing, but in a vectorized fashion, you can use this approach -
%// Create mask to set elements in both A and B to zeros
mask = A==0 | B==0
%// Set A and B to zeros at places where mask has TRUE values
A(mask) = 0
B(mask) = 0
If the bigger context of finding the sum of squares errors after the listed code could be considered, you can do so with this -
df = A - B;
df(A==0 | B==0) = 0;
ss_vectorized = sum(df(:).^2);
Or as #carandraug commented, you can use the built-in sumsq for the sum of squares calculation at the last step -
ss_vectorized = sumsq(df(:));

How can I find the center of a cluster of data points?

Let's say I plotted the position of a helicopter every day for the past year and came up with the following map:
Any human looking at this would be able to tell me that this helicopter is based out of Chicago.
How can I find the same result in code?
I'm looking for something like this:
$geoCodeArray = array([GET=http://pastebin.com/grVsbgL9]);
function findHome($geoCodeArray) {
// magic
return $geoCode;
}
Ultimately generating something like this:
UPDATE: Sample Dataset
Here's a map with a sample dataset: http://batchgeo.com/map/c3676fe29985f00e1605cd4f86920179
Here's a pastebin of 150 geocodes: http://pastebin.com/grVsbgL9
The above contains 150 geocodes. The first 50 are in a few clusters close to Chicago. The remaining are scattered throughout the country, including some small clusters in New York, Los Angeles, and San Francisco.
I have about a million (seriously) datasets like this that I'll need to iterate through and identify the most likely "home". Your help is greatly appreciated.
UPDATE 2: Airplane switched to Helicopter
The airplane concept was drawing too much attention toward physical airports. The coordinates can be anywhere in the world, not just airports. Let's assume it's a super helicopter not bound by physics, fuel, or anything else. It can land where it wants. ;)

The following solution works even if the points are scattered all over the Earth, by converting latitude and longitude to Cartesian coordinates. It does a kind of KDE (kernel density estimation), but in a first pass the sum of kernels is evaluated only at the data points. The kernel should be chosen to fit the problem. In the code below it is what I could jokingly/presumptuously call a Trossian, i.e., 2-d²/h² for d≤h and h²/d² for d>h (where d is the Euclidean distance and h is the "bandwidth" $global_kernel_radius), but it could also be a Gaussian (e-d²/2h²), an Epanechnikov kernel (1-d²/h² for d<h, 0 otherwise), or another kernel. An optional second pass refines the search locally, either by summing an independent kernel on a local grid, or by calculating the centroid, in both cases in a surrounding defined by $local_grid_radius.
In essence, each point sums all the points it has around (including itself), weighing them more if they are closer (by the bell curve), and also weighing them by the optional weight array $w_arr. The winner is the point with the maximum sum. Once the winner has been found, the "home" we are looking for can be found by repeating the same process locally around the winner (using another bell curve), or it can be estimated to be the "center of mass" of all points within a given radius from the winner, where the radius can be zero.
The algorithm must be adapted to the problem by choosing the appropriate kernels, by choosing how to refine the search locally, and by tuning the parameters. For the example dataset, the Trossian kernel for the first pass and the Epanechnikov kernel for the second pass, with all 3 radii set to 30 mi and a grid step of 1 mi could be a good starting point, but only if the two sub-clusters of Chicago should be seen as one big cluster. Otherwise smaller radii must be chosen.
function find_home($lat_arr, $lng_arr, $global_kernel_radius,
$local_kernel_radius,
$local_grid_radius, // 0 for no 2nd pass
$local_grid_step, // 0 for centroid
$units='mi',
$w_arr=null)
{
// for lat,lng <-> x,y,z see http://en.wikipedia.org/wiki/Geodetic_datum
// for K and h see http://en.wikipedia.org/wiki/Kernel_density_estimation
switch (strtolower($units)) {
/* */case 'nm' :
/*or*/case 'nmi': $m_divisor = 1852;
break;case 'mi': $m_divisor = 1609.344;
break;case 'km': $m_divisor = 1000;
break;case 'm': $m_divisor = 1;
break;default: return false;
}
$a = 6378137 / $m_divisor; // Earth semi-major axis (WGS84)
$e2 = 6.69437999014E-3; // First eccentricity squared (WGS84)
$lat_lng_count = count($lat_arr);
if ( !$w_arr) {
$w_arr = array_fill(0, $lat_lng_count, 1.0);
}
$x_arr = array();
$y_arr = array();
$z_arr = array();
$rad = M_PI / 180;
$one_e2 = 1 - $e2;
for ($i = 0; $i < $lat_lng_count; $i++) {
$lat = $lat_arr[$i];
$lng = $lng_arr[$i];
$sin_lat = sin($lat * $rad);
$sin_lng = sin($lng * $rad);
$cos_lat = cos($lat * $rad);
$cos_lng = cos($lng * $rad);
// height = 0 (!)
$N = $a / sqrt(1 - $e2 * $sin_lat * $sin_lat);
$x_arr[$i] = $N * $cos_lat * $cos_lng;
$y_arr[$i] = $N * $cos_lat * $sin_lng;
$z_arr[$i] = $N * $one_e2 * $sin_lat;
}
$h = $global_kernel_radius;
$h2 = $h * $h;
$max_K_sum = -1;
$max_K_sum_idx = -1;
for ($i = 0; $i < $lat_lng_count; $i++) {
$xi = $x_arr[$i];
$yi = $y_arr[$i];
$zi = $z_arr[$i];
$K_sum = 0;
for ($j = 0; $j < $lat_lng_count; $j++) {
$dx = $xi - $x_arr[$j];
$dy = $yi - $y_arr[$j];
$dz = $zi - $z_arr[$j];
$d2 = $dx * $dx + $dy * $dy + $dz * $dz;
$K_sum += $w_arr[$j] * ($d2 <= $h2 ? (2 - $d2 / $h2) : $h2 / $d2); // Trossian ;-)
// $K_sum += $w_arr[$j] * exp(-0.5 * $d2 / $h2); // Gaussian
}
if ($max_K_sum < $K_sum) {
$max_K_sum = $K_sum;
$max_K_sum_i = $i;
}
}
$winner_x = $x_arr [$max_K_sum_i];
$winner_y = $y_arr [$max_K_sum_i];
$winner_z = $z_arr [$max_K_sum_i];
$winner_lat = $lat_arr[$max_K_sum_i];
$winner_lng = $lng_arr[$max_K_sum_i];
$sin_winner_lat = sin($winner_lat * $rad);
$cos_winner_lat = cos($winner_lat * $rad);
$sin_winner_lng = sin($winner_lng * $rad);
$cos_winner_lng = cos($winner_lng * $rad);
$east_x = -$local_grid_step * $sin_winner_lng;
$east_y = $local_grid_step * $cos_winner_lng;
$east_z = 0;
$north_x = -$local_grid_step * $sin_winner_lat * $cos_winner_lng;
$north_y = -$local_grid_step * $sin_winner_lat * $sin_winner_lng;
$north_z = $local_grid_step * $cos_winner_lat;
if ($local_grid_radius > 0 && $local_grid_step > 0) {
$r = intval($local_grid_radius / $local_grid_step);
$r2 = $r * $r;
$h = $local_kernel_radius;
$h2 = $h * $h;
$max_L_sum = -1;
$max_L_sum_idx = -1;
for ($i = -$r; $i <= $r; $i++) {
$winner_east_x = $winner_x + $i * $east_x;
$winner_east_y = $winner_y + $i * $east_y;
$winner_east_z = $winner_z + $i * $east_z;
$j_max = intval(sqrt($r2 - $i * $i));
for ($j = -$j_max; $j <= $j_max; $j++) {
$x = $winner_east_x + $j * $north_x;
$y = $winner_east_y + $j * $north_y;
$z = $winner_east_z + $j * $north_z;
$L_sum = 0;
for ($k = 0; $k < $lat_lng_count; $k++) {
$dx = $x - $x_arr[$k];
$dy = $y - $y_arr[$k];
$dz = $z - $z_arr[$k];
$d2 = $dx * $dx + $dy * $dy + $dz * $dz;
if ($d2 < $h2) {
$L_sum += $w_arr[$k] * ($h2 - $d2); // Epanechnikov
}
}
if ($max_L_sum < $L_sum) {
$max_L_sum = $L_sum;
$max_L_sum_i = $i;
$max_L_sum_j = $j;
}
}
}
$x = $winner_x + $max_L_sum_i * $east_x + $max_L_sum_j * $north_x;
$y = $winner_y + $max_L_sum_i * $east_y + $max_L_sum_j * $north_y;
$z = $winner_z + $max_L_sum_i * $east_z + $max_L_sum_j * $north_z;
} else if ($local_grid_radius > 0) {
$r = $local_grid_radius;
$r2 = $r * $r;
$wx_sum = 0;
$wy_sum = 0;
$wz_sum = 0;
$w_sum = 0;
for ($k = 0; $k < $lat_lng_count; $k++) {
$xk = $x_arr[$k];
$yk = $y_arr[$k];
$zk = $z_arr[$k];
$dx = $winner_x - $xk;
$dy = $winner_y - $yk;
$dz = $winner_z - $zk;
$d2 = $dx * $dx + $dy * $dy + $dz * $dz;
if ($d2 <= $r2) {
$wk = $w_arr[$k];
$wx_sum += $wk * $xk;
$wy_sum += $wk * $yk;
$wz_sum += $wk * $zk;
$w_sum += $wk;
}
}
$x = $wx_sum / $w_sum;
$y = $wy_sum / $w_sum;
$z = $wz_sum / $w_sum;
$max_L_sum_i = false;
$max_L_sum_j = false;
} else {
return array($winner_lat, $winner_lng, $max_K_sum_i, false, false);
}
$deg = 180 / M_PI;
$a2 = $a * $a;
$e4 = $e2 * $e2;
$p = sqrt($x * $x + $y * $y);
$zeta = (1 - $e2) * $z * $z / $a2;
$rho = ($p * $p / $a2 + $zeta - $e4) / 6;
$rho3 = $rho * $rho * $rho;
$s = $e4 * $zeta * $p * $p / (4 * $a2);
$t = pow($s + $rho3 + sqrt($s * ($s + 2 * $rho3)), 1 / 3);
$u = $rho + $t + $rho * $rho / $t;
$v = sqrt($u * $u + $e4 * $zeta);
$w = $e2 * ($u + $v - $zeta) / (2 * $v);
$k = 1 + $e2 * (sqrt($u + $v + $w * $w) + $w) / ($u + $v);
$lat = atan($k * $z / $p) * $deg;
$lng = atan2($y, $x) * $deg;
return array($lat, $lng, $max_K_sum_i, $max_L_sum_i, $max_L_sum_j);
}
The fact that distances are Euclidean and not great-circle should have negligible effects for the task at hand. Calculating great-circle distances would be much more cumbersome, and would cause only the weight of very far points to be significantly lower - but these points already have a very low weight. In principle, the same effect could be achieved by a different kernel. Kernels that have a complete cut-off beyond some distance, like the Epanechnikov kernel, don't have this problem at all (in practice).
The conversion between lat,lng and x,y,z for the WGS84 datum is given exactly (although without guarantee of numerical stability) more as a reference than because of a true need. If the height is to be taken into account, or if a faster back-conversion is needed, please refer to the Wikipedia article.
The Epanechnikov kernel, besides being "more local" than the Gaussian and Trossian kernels, has the advantage of being the fastest for the second loop, which is O(ng), where g is the number of points of the local grid, and can also be employed in the first loop, which is O(n²), if n is big.

This can be solved by finding a jeopardy surface. See Rossmo's Formula.
This is the predator problem. Given a set of geographically-located carcasses, where is the lair of the predator? Rossmo's formula solves this problem.

Find the point with the largest density estimate.
Should be pretty much straightforward. Use a kernel radius that roughly covers a large airport in diameter. A 2D Gaussian or Epanechnikov kernel should be fine.
http://en.wikipedia.org/wiki/Multivariate_kernel_density_estimation
This is similar to computing a Heap Map: http://en.wikipedia.org/wiki/Heat_map
and then finding the brightest spot there. Except it computes the brightness right away.
For fun I read a 1% sample of the Geocoordinates of DBpedia (i.e. Wikipedia) into ELKI, projected it into 3D space and enabled the density estimation overlay (hidden in the visualizers scatterplot menu). You can see there is a hotspot on Europe, and to a lesser extend in the US. The hotspot in Europe is Poland I believe. Last I checked, someone apparently had created a Wikipedia article with Geocoordinates for pretty much any town in Poland. The ELKI visualizer, unfortunately, neither allows you to zoom in, rotate, or reduce the kernel bandwidth to visually find the most dense point. But it's straightforward to implement yourself; you probably also don't need to go into 3D space, but can just use latitudes and longitudes.
Kernel Density Estimation should be available in tons of applications. The one in R is probably much more powerful. I just recently discovered this heatmap in ELKI, so I knew how to quickly access it. See e.g. http://stat.ethz.ch/R-manual/R-devel/library/stats/html/density.html for a related R function.
On your data, in R, try for example:
library(kernSmooth)
smoothScatter(data, nbin=512, bandwidth=c(.25,.25))
this should show a strong preference for Chicago.
library(kernSmooth)
dens=bkde2D(data, gridsize=c(512, 512), bandwidth=c(.25,.25))
contour(dens$x1, dens$x2, dens$fhat)
maxpos = which(dens$fhat == max(dens$fhat), arr.ind=TRUE)
c(dens$x1[maxpos[1]], dens$x2[maxpos[2]])
yields [1] 42.14697 -88.09508, which is less than 10 miles from Chicago airport.
To get better coordinates try:
rerunning on a 20x20 miles area around the estimated coordinates
a non-binned KDE in that area
better bandwidth selection with dpik
higher grid resolution

in Astrophysics we use the so called "half mass radius". Given a distribution and its center, the half mass radius is the minimum radius of a circle that contains half of the points of your distribution.
This quantity is a characteristic length of a distribution of points.
If you want that the home of the helicopter is where the points are maximally concentrated so it is the point that has the minimum half mass radius!
My algorithm is as follows: for each point you compute this half mass radius centring the distribution in the current point. The "home" of the helicopter will be the point with the minimum half mass radius.
I've implemented it and the computed center is 42.149994 -88.133698 (which is in Chicago)
I've also used the 0.2 of the total mass instead of the 0.5(half) usually used in Astrophysics.
This is my (in python) alghorithm that finds the home of the helicopter:
import math
import numpy
def inside(points,center,radius):
ids=(((points[:,0]-center[0])**2.+(points[:,1]-center[1])**2.)<=radius**2.)
return points[ids]
points = numpy.loadtxt(open('points.txt'),comments='#')
npoints=len(points)
deltar=0.1
idcenter=None
halfrmin=None
for i in xrange(0,npoints):
center=points[i]
radius=0.
stayHere=True
while stayHere:
radius=radius+deltar
ninside=len(inside(points,center,radius))
#print 'point',i,'r',radius,'in',ninside,'center',center
if(ninside>=npoints*0.2):
if(halfrmin==None or radius<halfrmin):
halfrmin=radius
idcenter=i
print 'point',i,halfrmin,idcenter,points[idcenter]
stayHere=False
#print halfrmin,idcenter
print points[idcenter]

You can use DBSCAN for that task.
DBSCAN is a density based clustering with a notion of noise. You need two parameters:
First the number of points a cluster should have at minimum "minpoints".
And second a neighbourhood parameter called "epsilon" that sets a distance threshold to the surrounding points that should be included in your cluster.
The whole algorithm works like this:
Start with an arbitrary point in your set that hasn't been visited yet
Retrieve points from the epsilon neighbourhood mark all as visited
if you have found enough points in this neighbourhood (> minpoints parameter) you start a new cluster and assign those points. Now recurse into step 2 again for every point in this cluster.
if you don't have, declare this point as noise
go all over again until you've visited all points
It is really simple to implement and there are lots of frameworks that support this algorithm already. To find the mean of your cluster, you can simply take the mean of all the assigned points from its neighbourhood.
However, unlike the method that #TylerDurden proposes, this needs a parameterization- so you need to find some hand tuned parameters that fit your problem.
In your case, you can try to set the minpoints to 10% of your total points if the plane is likely to stay 10% of the time you track at an airport. The density parameter epsilon depends on the resolution of your geographic sensor and the distance metric you use- I would suggest the haversine distance for geographic data.

How about divide the map into many zones and then find the center of plane in zone with the most plane. Algorithm will be something like this
set Zones[40]
foreach Plane in Planes
Zones[GetZone(Plane.position)].Add(Plane)
set MaxZone = Zones[0]
foreach Zone in Zones
if MaxZone.Length() < Zone.Length()
MaxZone = Zone
set Center
foreach Plane in MaxZone
Center.X += Plane.X
Center.Y += Plane.Y
Center.X /= MaxZone.Length
Center.Y /= MaxZone.Length

All I have on this machine is an old compiler so I made an ASCII version of this. It "draws" (in ASCII) a map - dots are points, X is where the real source is, G is where the guessed source is. If the two overlap, only X is shown.
Examples (DIFFICULTY 1.5 and 3 respectively):
The points are generated by picking a random point as the source, then randomly distributing points, making them more likely to be closer to the source.
DIFFICULTY is a floating point constant that regulates the initial point generation - how much more likely the points are to be closer to the source - if it is 1 or less, the program should be able to guess the exact source, or very close. At 2.5, it should still be pretty decent. At 4+, it will start to guess worse, but I think it still guesses better than a human would.
It could be optimized by using binary search over X, then Y - this would make the guess worse, but would be much, much faster. Or by starting with larger blocks, then splitting the best block further (or the best block and the 8 surrounding it). For a higher resolution system, one of these would be necessary. This is quite a naive approach, though, but it seems to work well in an 80x24 system. :D
#include <stdio.h>
#include <stdlib.h>
#include <time.h>
#include <math.h>
#define Y 24
#define X 80
#define DIFFICULTY 1 // Try different values...
static int point[Y][X];
double dist(int x1, int y1, int x2, int y2)
{
return sqrt((y1 - y2)*(y1 - y2) + (x1 - x2)*(x1 - x2));
}
main()
{
srand(time(0));
int y = rand()%Y;
int x = rand()%X;
// Generate points
for (int i = 0; i < Y; i++)
{
for (int j = 0; j < X; j++)
{
double u = DIFFICULTY * pow(dist(x, y, j, i), 1.0 / DIFFICULTY);
if ((int)u == 0)
u = 1;
point[i][j] = !(rand()%(int)u);
}
}
// Find best source
int maxX = -1;
int maxY = -1;
double maxScore = -1;
for (int cy = 0; cy < Y; cy++)
{
for (int cx = 0; cx < X; cx++)
{
double score = 0;
for (int i = 0; i < Y; i++)
{
for (int j = 0; j < X; j++)
{
if (point[i][j] == 1)
{
double d = dist(cx, cy, j, i);
if (d == 0)
d = 0.5;
score += 1000 / d;
}
}
}
if (score > maxScore || maxScore == -1)
{
maxScore = score;
maxX = cx;
maxY = cy;
}
}
}
// Print out results
for (int i = 0; i < Y; i++)
{
for (int j = 0; j < X; j++)
{
if (i == y && j == x)
printf("X");
else if (i == maxY && j == maxX)
printf("G");
else if (point[i][j] == 0)
printf(" ");
else if (point[i][j] == 1)
printf(".");
}
}
printf("Distance from real source: %f", dist(maxX, maxY, x, y));
scanf("%d", 0);
}

Virtual earth has a very good explanation of how you can do it relatively quick. They also have provided code examples. Please have a look at http://soulsolutions.com.au/Articles/ClusteringVirtualEarthPart1.aspx

A simple mixture model seems to work pretty well for this problem.
In general, to get a point that minimizes the distance to all other points in a dataset, you can just take the mean. In this case, you want to find a point that minimizes the distance from a subset of concentrated points. If you postulate that a point can either come from the concentrated set of points of interest or from a diffuse set of background points, then this gives a mixture model.
I have included some python code below. The concentrated area is modeled by a high-precision normal distribution and the background point are modeled by either a low-precision normal distribution or a uniform distribution over a bounding box on the dataset (there is a line of code that can be commented out to switch between these options). Also, mixture models can be somewhat unstable, so running the EM algorithm a few times with random initial conditions and choosing the run with the highest log-likelihood gives better results.
If you are actually looking at airplanes, then adding some sort of time dependent dynamics will probably improve your ability to infer the home base immensely.
I would also be wary of Rossimo's formula because it includes some pretty-strong assumptions about crime distributions.
#the dataset
sdata='''41.892694,-87.670898
42.056048,-88.000488
41.941744,-88.000488
42.072361,-88.209229
42.091933,-87.982635
42.149994,-88.133698
42.171371,-88.286133
42.23241,-88.305359
42.196811,-88.099365
42.189689,-88.188629
42.17646,-88.173523
42.180531,-88.209229
42.18168,-88.187943
42.185496,-88.166656
42.170485,-88.150864
42.150634,-88.140564
42.156743,-88.123741
42.118555,-88.105545
42.121356,-88.112755
42.115499,-88.102112
42.119319,-88.112411
42.118046,-88.110695
42.117791,-88.109322
42.182189,-88.182449
42.194145,-88.183823
42.189057,-88.196182
42.186513,-88.200645
42.180917,-88.197899
42.178881,-88.192062
41.881656,-87.6297
41.875521,-87.6297
41.87872,-87.636566
41.872073,-87.62661
41.868239,-87.634506
41.86875,-87.624893
41.883065,-87.62352
41.881021,-87.619743
41.879998,-87.620087
41.8915,-87.633476
41.875163,-87.620773
41.879125,-87.62558
41.862763,-87.608757
41.858672,-87.607899
41.865192,-87.615795
41.87005,-87.62043
42.073061,-87.973022
42.317241,-88.187256
42.272546,-88.088379
42.244086,-87.890625
42.044512,-88.28064
39.754977,-86.154785
39.754977,-89.648437
41.043369,-85.12207
43.050074,-89.406738
43.082179,-87.912598
42.7281,-84.572754
39.974226,-83.056641
38.888093,-77.01416
39.923692,-75.168457
40.794318,-73.959961
40.877439,-73.146973
40.611086,-73.740234
40.627764,-73.234863
41.784881,-71.367187
42.371988,-70.993652
35.224587,-80.793457
36.753465,-76.069336
39.263361,-76.530762
25.737127,-80.222168
26.644083,-81.958008
30.50223,-87.275391
29.436309,-98.525391
30.217839,-97.844238
29.742023,-95.361328
31.500409,-97.163086
32.691688,-96.877441
32.691688,-97.404785
35.095754,-106.655273
33.425138,-112.104492
32.873244,-117.114258
33.973545,-118.256836
33.681497,-117.905273
33.622982,-117.734985
33.741828,-118.092041
33.64585,-117.861328
33.700707,-118.015137
33.801189,-118.251343
33.513132,-117.740479
32.777244,-117.235107
32.707939,-117.158203
32.703317,-117.268066
32.610821,-117.075806
34.419726,-119.701538
37.750358,-122.431641
37.50673,-122.387695
37.174817,-121.904297
37.157307,-122.321777
37.271492,-122.033386
37.435238,-122.217407
37.687794,-122.415161
37.542025,-122.299805
37.609506,-122.398682
37.544203,-122.0224
37.422151,-122.13501
37.395971,-122.080078
45.485651,-122.739258
47.719463,-122.255859
47.303913,-122.607422
45.176713,-122.167969
39.566,-104.985352
39.124201,-94.614258
35.454518,-97.426758
38.473482,-90.175781
45.021612,-93.251953
42.417881,-83.056641
41.371141,-81.782227
33.791132,-84.331055
30.252543,-90.439453
37.421248,-122.174835
37.47794,-122.181702
37.510628,-122.254486
37.56943,-122.346497
37.593373,-122.384949
37.620571,-122.489319
36.984249,-122.03064
36.553017,-121.893311
36.654442,-121.772461
36.482381,-121.876831
36.15042,-121.651611
36.274518,-121.838379
37.817717,-119.569702
39.31657,-120.140991
38.933041,-119.992676
39.13785,-119.778442
39.108019,-120.239868
38.586082,-121.503296
38.723354,-121.289062
37.878444,-119.437866
37.782994,-119.470825
37.973771,-119.685059
39.001377,-120.17395
40.709076,-73.948975
40.846346,-73.861084
40.780452,-73.959961
40.778829,-73.958931
40.78372,-73.966012
40.783688,-73.965325
40.783692,-73.965615
40.783675,-73.965741
40.783835,-73.965873
'''
import StringIO
import numpy as np
import re
import matplotlib.pyplot as plt
def lp(l):
return map(lambda m: float(m.group()),re.finditer('[^, \n]+',l))
data=np.array(map(lp,StringIO.StringIO(sdata)))
xmn=np.min(data[:,0])
xmx=np.max(data[:,0])
ymn=np.min(data[:,1])
ymx=np.max(data[:,1])
# area of the point set bounding box
area=(xmx-xmn)*(ymx-ymn)
M_ITER=100 #maximum number of iterations
THRESH=1e-10 # stopping threshold
def em(x):
print '\nSTART EM'
mlst=[]
mu0=np.mean( data , 0 ) # the sample mean of the data - use this as the mean of the low-precision gaussian
# the mean of the high-precision Gaussian - this is what we are looking for
mu=np.random.rand( 2 )*np.array([xmx-xmn,ymx-ymn])+np.array([xmn,ymn])
lam_lo=.001 # precision of the low-precision Gaussian
lam_hi=.1 # precision of the high-precision Gaussian
prz=np.random.rand( 1 ) # probability of choosing the high-precision Gaussian mixture component
for i in xrange(M_ITER):
mlst.append(mu[:])
l_hi=np.log(prz)+np.log(lam_hi)-.5*lam_hi*np.sum((x-mu)**2,1)
#low-precision normal background distribution
l_lo=np.log(1.0-prz)+np.log(lam_lo)-.5*lam_lo*np.sum((x-mu0)**2,1)
#uncomment for the uniform background distribution
#l_lo=np.log(1.0-prz)-np.log(area)
#expectation step
zs=1.0/(1.0+np.exp(l_lo-l_hi))
#compute bound on the likelihood
lh=np.sum(zs*l_hi+(1.0-zs)*l_lo)
print i,lh
#maximization step
mu=np.sum(zs[:,None]*x,0)/np.sum(zs) #mean
lam_hi=np.sum(zs)/np.sum(zs*.5*np.sum((x-mu)**2,1)) #precision
prz=1.0/(1.0+np.sum(1.0-zs)/np.sum(zs)) #mixure component probability
try:
if np.abs((lh-old_lh)/lh)<THRESH:
break
except:
pass
old_lh=lh
mlst.append(mu[:])
return lh,lam_hi,mlst
if __name__=='__main__':
#repeat the EM algorithm a number of times and get the run with the best log likelihood
mx_prm=em(data)
for i in xrange(4):
prm=em(data)
if prm[0]>mx_prm[0]:
mx_prm=prm
print prm[0]
print mx_prm[0]
lh,lam_hi,mlst=mx_prm
mu=mlst[-1]
print 'best loglikelihood:', lh
#print 'final precision value:', lam_hi
print 'point of interest:', mu
plt.plot(data[:,0],data[:,1],'.b')
for m in mlst:
plt.plot(m[0],m[1],'xr')
plt.show()

You can easily adapt the Rossmo's formula, quoted by Tyler Durden to your case with few simple notes:
The formula :
This formula give something close to a probability of presence of the base operation for a predator or a serial killer. In your case it could give the probability of a base to be in a certain point. I'll explain later how to use it. U can write it this way :
Proba(base on point A)= Sum{on all spots} ( Phi/(dist^f)+(1-Phi)(B*(g-f))/(2B-dist)^g )
Using Euclidian distance
You want an Euclidian distance and not the Manhattan's one because an airplane or helicopter is not bound to road/streets. So using Euclidian distance is the correct way, if you are tracking an airplane & not a serial killer. So "dist" in the formula is the euclidian distance between the spot ur testing and the spot considered
Taking reasonable variable B
Variable B was used to represent the rule "reasonably smart killer will not kill his neighbor". In your case the will also applied because no one use an airplane/roflcopter to get to the next street corner. we can suppose that the minimal journey is for example 10km or anything reasonable when applied to your case.
Exponential factor f
Factor f is used to add a weight to the distance. For example if all the spots are in a small area you could want a big factor f because the probability of the airport/base/HQ will decrease fast if all your datapoint are in the same sector. g works in a similar way, allow to choose the size of "base is unlikely to be just next to the spot" area
Factor Phi :
Again this factor has to be determined using your knowledge of the problem. It permits to choose the most accurate factor between "base is close to spots" and "i'll not use the plane to make 5 m" if for example u think that the second one is almost irrelevent you can set Phi to 0.95 (0<Phi<1) If both are interesting phi will be around 0.5
How to implement it as something usefull :
First you want to divide your map into little squares : meshing the map ( just like invisal did) (the smaller the squares ,the more accurate the result (in general)) then using the formula to find the more probable location. In fact the mesh is just an array with all possible locations. (if u want to be accurate you increase the number of possible spots but it will require more computational time and PhP is not well-known for it's amazing speed)
Algorithm :
//define all the factors you need(B , f , g , phi)
for(i=0..mesh_size) // computing the probability of presence for each square of the mesh
{
P(i)=0;
geocode squarePosition;//GeoCode of the square's center
for(j=0..geocodearray_size)//sum on all the known spots
{
dist=Distance(geocodearray[j],squarePosition);//small function returning distance between two geocodes
P(i)+=(Phi/pow(dist,f))+(1-Phi)*pow(B,g-f)/pow(2B-dist,g);
}
}
return geocode corresponding to max(P(i))
Hope that it will help you

First I would like to express my fondness of your method in illustrating and explaining the problem ..
If I were in your shoes, I would go for a density based algorithm like DBSCAN
and then after clustering the areas and removing the noise points a few areas (choices) will remain .. then I'll take the cluster with the highest density of points and calculate the average point and find the nearest real point to it . done, found the place! :).
Regards,

Why not something like this:
For each point, calculate it's distance from all other points and sum the total.
The point with the smallest sum is your center.
Maybe sum isn't the best metric to use. Possibly the point with the most "small distances"?

Sum over the distances. Take the point with the smallest summed distance.
function () {
for i in points P:
S[i] = 0
for j in points P:
S[i] += distance(P[i], P[j])
return min(S);
}

You can take a minimum spanning tree and remove the longest edges. The smaller trees give you the centeroid to lookup. The algorithm name is single-link k-clustering. There is a post here: https://stats.stackexchange.com/questions/1475/visualization-software-for-clustering.

An interview question: About Probability

An interview question:
Given a function f(x) that 1/4 times returns 0, 3/4 times returns 1.
Write a function g(x) using f(x) that 1/2 times returns 0, 1/2 times returns 1.
My implementation is:
function g(x) = {
if (f(x) == 0){ // 1/4
var s = f(x)
if( s == 1) {// 3/4 * 1/4
return s // 3/16
} else {
g(x)
}
} else { // 3/4
var k = f(x)
if( k == 0) {// 1/4 * 3/4
return k // 3/16
} else {
g(x)
}
}
}
Am I right? What's your solution?(you can use any language)

If you call f(x) twice in a row, the following outcomes are possible (assuming that
successive calls to f(x) are independent, identically distributed trials):
00 (probability 1/4 * 1/4)
01 (probability 1/4 * 3/4)
10 (probability 3/4 * 1/4)
11 (probability 3/4 * 3/4)
01 and 10 occur with equal probability. So iterate until you get one of those
cases, then return 0 or 1 appropriately:
do
a=f(x); b=f(x);
while (a == b);
return a;
It might be tempting to call f(x) only once per iteration and keep track of the two
most recent values, but that won't work. Suppose the very first roll is 1,
with probability 3/4. You'd loop until the first 0, then return 1 (with probability 3/4).

The problem with your algorithm is that it repeats itself with high probability. My code:
function g(x) = {
var s = f(x) + f(x) + f(x);
// s = 0, probability: 1/64
// s = 1, probability: 9/64
// s = 2, probability: 27/64
// s = 3, probability: 27/64
if (s == 2) return 0;
if (s == 3) return 1;
return g(x); // probability to go into recursion = 10/64, with only 1 additional f(x) calculation
}
I've measured average number of times f(x) was calculated for your algorithm and for mine. For yours f(x) was calculated around 5.3 times per one g(x) calculation. With my algorithm this number reduced to around 3.5. The same is true for other answers so far since they are actually the same algorithm as you said.
P.S.: your definition doesn't mention 'random' at the moment, but probably it is assumed. See my other answer.

Your solution is correct, if somewhat inefficient and with more duplicated logic. Here is a Python implementation of the same algorithm in a cleaner form.
def g ():
while True:
a = f()
if a != f():
return a
If f() is expensive you'd want to get more sophisticated with using the match/mismatch information to try to return with fewer calls to it. Here is the most efficient possible solution.
def g ():
lower = 0.0
upper = 1.0
while True:
if 0.5 < lower:
return 1
elif upper < 0.5:
return 0
else:
middle = 0.25 * lower + 0.75 * upper
if 0 == f():
lower = middle
else:
upper = middle
This takes about 2.6 calls to g() on average.
The way that it works is this. We're trying to pick a random number from 0 to 1, but we happen to stop as soon as we know whether the number is 0 or 1. We start knowing that the number is in the interval (0, 1). 3/4 of the numbers are in the bottom 3/4 of the interval, and 1/4 are in the top 1/4 of the interval. We decide which based on a call to f(x). This means that we are now in a smaller interval.
If we wash, rinse, and repeat enough times we can determine our finite number as precisely as possible, and will have an absolutely equal probability of winding up in any region of the original interval. In particular we have an even probability of winding up bigger than or less than 0.5.
If you wanted you could repeat the idea to generate an endless stream of bits one by one. This is, in fact, provably the most efficient way of generating such a stream, and is the source of the idea of entropy in information theory.

Given a function f(x) that 1/4 times returns 0, 3/4 times returns 1
Taking this statement literally, f(x) if called four times will always return zero once and 1 3 times. This is different than saying f(x) is a probabalistic function and the 0 to 1 ratio will approach 1 to 3 (1/4 vs 3/4) over many iterations. If the first interpretation is valid, than the only valid function for f(x) that will meet the criteria regardless of where in the sequence you start from is the sequence 0111 repeating. (or 1011 or 1101 or 1110 which are the same sequence from a different starting point). Given that constraint,
g()= (f() == f())
should suffice.

As already mentioned your definition is not that good regarding probability. Usually it means that not only probability is good but distribution also. Otherwise you can simply write g(x) which will return 1,0,1,0,1,0,1,0 - it will return them 50/50, but numbers won't be random.
Another cheating approach might be:
var invert = false;
function g(x) {
invert = !invert;
if (invert) return 1-f(x);
return f(x);
}
This solution will be better than all others since it calls f(x) only one time. But the results will not be very random.

A refinement of the same approach used in btilly's answer, achieving an average ~1.85 calls to f() per g() result (further refinement documented below achieves ~1.75, tbilly's ~2.6, Jim Lewis's accepted answer ~5.33). Code appears lower in the answer.
Basically, I generate random integers in the range 0 to 3 with even probability: the caller can then test bit 0 for the first 50/50 value, and bit 1 for a second. Reason: the f() probabilities of 1/4 and 3/4 map onto quarters much more cleanly than halves.
Description of algorithm
btilly explained the algorithm, but I'll do so in my own way too...
The algorithm basically generates a random real number x between 0 and 1, then returns a result depending on which "result bucket" that number falls in:
result bucket result
x < 0.25 0
0.25 <= x < 0.5 1
0.5 <= x < 0.75 2
0.75 <= x 3
But, generating a random real number given only f() is difficult. We have to start with the knowledge that our x value should be in the range 0..1 - which we'll call our initial "possible x" space. We then hone in on an actual value for x:
each time we call f():
if f() returns 0 (probability 1 in 4), we consider x to be in the lower quarter of the "possible x" space, and eliminate the upper three quarters from that space
if f() returns 1 (probability 3 in 4), we consider x to be in the upper three-quarters of the "possible x" space, and eliminate the lower quarter from that space
when the "possible x" space is completely contained by a single result bucket, that means we've narrowed x down to the point where we know which result value it should map to and have no need to get a more specific value for x.
It may or may not help to consider this diagram :-):
"result bucket" cut-offs 0,.25,.5,.75,1
0=========0.25=========0.5==========0.75=========1 "possible x" 0..1
| | . . | f() chooses x < vs >= 0.25
| result 0 |------0.4375-------------+----------| "possible x" .25..1
| | result 1| . . | f() chooses x < vs >= 0.4375
| | | . ~0.58 . | "possible x" .4375..1
| | | . | . | f() chooses < vs >= ~.58
| | ||. | | . | 4 distinct "possible x" ranges
Code
int g() // return 0, 1, 2, or 3
{
if (f() == 0) return 0;
if (f() == 0) return 1;
double low = 0.25 + 0.25 * (1.0 - 0.25);
double high = 1.0;
while (true)
{
double cutoff = low + 0.25 * (high - low);
if (f() == 0)
high = cutoff;
else
low = cutoff;
if (high < 0.50) return 1;
if (low >= 0.75) return 3;
if (low >= 0.50 && high < 0.75) return 2;
}
}
If helpful, an intermediary to feed out 50/50 results one at a time:
int h()
{
static int i;
if (!i)
{
int x = g();
i = x | 4;
return x & 1;
}
else
{
int x = i & 2;
i = 0;
return x ? 1 : 0;
}
}
NOTE: This can be further tweaked by having the algorithm switch from considering an f()==0 result to hone in on the lower quarter, to having it hone in on the upper quarter instead, based on which on average resolves to a result bucket more quickly. Superficially, this seemed useful on the third call to f() when an upper-quarter result would indicate an immediate result of 3, while a lower-quarter result still spans probability point 0.5 and hence results 1 and 2. When I tried it, the results were actually worse. A more complex tuning was needed to see actual benefits, and I ended up writing a brute-force comparison of lower vs upper cutoff for second through eleventh calls to g(). The best result I found was an average of ~1.75, resulting from the 1st, 2nd, 5th and 8th calls to g() seeking low (i.e. setting low = cutoff).

Here is a solution based on central limit theorem, originally due to a friend of mine:
/*
Given a function f(x) that 1/4 times returns 0, 3/4 times returns 1. Write a function g(x) using f(x) that 1/2 times returns 0, 1/2 times returns 1.
*/
#include <iostream>
#include <cstdlib>
#include <ctime>
#include <cstdio>
using namespace std;
int f() {
if (rand() % 4 == 0) return 0;
return 1;
}
int main() {
srand(time(0));
int cc = 0;
for (int k = 0; k < 1000; k++) { //number of different runs
int c = 0;
int limit = 10000; //the bigger the limit, the more we will approach %50 percent
for (int i=0; i<limit; ++i) c+= f();
cc += c < limit*0.75 ? 0 : 1; // c will be 0, with probability %50
}
printf("%d\n",cc); //cc is gonna be around 500
return 0;
}

Since each return of f() represents a 3/4 chance of TRUE, with some algebra we can just properly balance the odds. What we want is another function x() which returns a balancing probability of TRUE, so that
function g() {
return f() && x();
}
returns true 50% of the time.
So let's find the probability of x (p(x)), given p(f) and our desired total probability (1/2):
p(f) * p(x) = 1/2
3/4 * p(x) = 1/2
p(x) = (1/2) / 3/4
p(x) = 2/3
So x() should return TRUE with a probability of 2/3, since 2/3 * 3/4 = 6/12 = 1/2;
Thus the following should work for g():
function g() {
return f() && (rand() < 2/3);
}

Assuming
P(f[x] == 0) = 1/4
P(f[x] == 1) = 3/4
and requiring a function g[x] with the following assumptions
P(g[x] == 0) = 1/2
P(g[x] == 1) = 1/2
I believe the following definition of g[x] is sufficient (Mathematica)
g[x_] := If[f[x] + f[x + 1] == 1, 1, 0]
or, alternatively in C
int g(int x)
{
return f(x) + f(x+1) == 1
? 1
: 0;
}
This is based on the idea that invocations of {f[x], f[x+1]} would produce the following outcomes
{
{0, 0},
{0, 1},
{1, 0},
{1, 1}
}
Summing each of the outcomes we have
{
0,
1,
1,
2
}
where a sum of 1 represents 1/2 of the possible sum outcomes, with any other sum making up the other 1/2.
Edit.
As bdk says - {0,0} is less likely than {1,1} because
1/4 * 1/4 < 3/4 * 3/4
However, I am confused myself because given the following definition for f[x] (Mathematica)
f[x_] := Mod[x, 4] > 0 /. {False -> 0, True -> 1}
or alternatively in C
int f(int x)
{
return (x % 4) > 0
? 1
: 0;
}
then the results obtained from executing f[x] and g[x] seem to have the expected distribution.
Table[f[x], {x, 0, 20}]
{0, 1, 1, 1, 0, 1, 1, 1, 0, 1, 1, 1, 0, 1, 1, 1, 0, 1, 1, 1, 0}
Table[g[x], {x, 0, 20}]
{1, 0, 0, 1, 1, 0, 0, 1, 1, 0, 0, 1, 1, 0, 0, 1, 1, 0, 0, 1, 1}

This is much like the Monty Hall paradox.
In general.
Public Class Form1
'the general case
'
'twiceThis = 2 is 1 in four chance of 0
'twiceThis = 3 is 1 in six chance of 0
'
'twiceThis = x is 1 in 2x chance of 0
Const twiceThis As Integer = 7
Const numOf As Integer = twiceThis * 2
Private Sub Button1_Click(ByVal sender As System.Object, _
ByVal e As System.EventArgs) Handles Button1.Click
Const tries As Integer = 1000
y = New List(Of Integer)
Dim ct0 As Integer = 0
Dim ct1 As Integer = 0
Debug.WriteLine("")
''show all possible values of fx
'For x As Integer = 1 To numOf
' Debug.WriteLine(fx)
'Next
'test that gx returns 50% 0's and 50% 1's
Dim stpw As New Stopwatch
stpw.Start()
For x As Integer = 1 To tries
Dim g_x As Integer = gx()
'Debug.WriteLine(g_x.ToString) 'used to verify that gx returns 0 or 1 randomly
If g_x = 0 Then ct0 += 1 Else ct1 += 1
Next
stpw.Stop()
'the results
Debug.WriteLine((ct0 / tries).ToString("p1"))
Debug.WriteLine((ct1 / tries).ToString("p1"))
Debug.WriteLine((stpw.ElapsedTicks / tries).ToString("n0"))
End Sub
Dim prng As New Random
Dim y As New List(Of Integer)
Private Function fx() As Integer
'1 in numOf chance of zero being returned
If y.Count = 0 Then
'reload y
y.Add(0) 'fx has only one zero value
Do
y.Add(1) 'the rest are ones
Loop While y.Count < numOf
End If
'return a random value
Dim idx As Integer = prng.Next(y.Count)
Dim rv As Integer = y(idx)
y.RemoveAt(idx) 'remove the value selected
Return rv
End Function
Private Function gx() As Integer
'a function g(x) using f(x) that 50% of the time returns 0
' that 50% of the time returns 1
Dim rv As Integer = 0
For x As Integer = 1 To twiceThis
fx()
Next
For x As Integer = 1 To twiceThis
rv += fx()
Next
If rv = twiceThis Then Return 1 Else Return 0
End Function
End Class

How do I generate a random string of up to a certain length?

I would like to generate a random string (or a series of random strings, repetitions allowed) of length between 1 and n characters from some (finite) alphabet. Each string should be equally likely (in other words, the strings should be uniformly distributed).
The uniformity requirement means that an algorithm like this doesn't work:
alphabet = "abcdefghijklmnopqrstuvwxyz"
len = rand(1, n)
s = ""
for(i = 0; i < len; ++i)
s = s + alphabet[rand(0, 25)]
(pseudo code, rand(a, b) returns a integer between a and b, inclusively, each integer equally likely)
This algorithm generates strings with uniformly distributed lengths, but the actual distribution should be weighted toward longer strings (there are 26 times as many strings with length 2 as there are with length 1, and so on.) How can I achieve this?

What you need to do is generate your length and then your string as two distinct steps. You will need to first chose the length using a weighted approach. You can calculate the number of strings of a given length l for an alphabet of k symbols as k^l. Sum those up and then you have the total number of strings of any length, your first step is to generate a random number between 1 and that value and then bin it accordingly. Modulo off by one errors you would break at 26, 26^2, 26^3, 26^4 and so on. The logarithm based on the number of symbols would be useful for this task.
Once you have you length then you can generate the string as you have above.

Okay, there are 26 possibilities for a 1-character string, 262 for a 2-character string, and so on up to 2626 possibilities for a 26-character string.
That means there are 26 times as many possibilities for an (N)-character string than there are for an (N-1)-character string. You can use that fact to select your length:
def getlen(maxlen):
sz = maxlen
while sz != 1:
if rnd(27) != 1:
return sz
sz--;
return 1
I use 27 in the above code since the total sample space for selecting strings from "ab" is the 26 1-character possibilities and the 262 2-character possibilities. In other words, the ratio is 1:26 so 1-character has a probability of 1/27 (rather than 1/26 as I first answered).
This solution isn't perfect since you're calling rnd multiple times and it would be better to call it once with an possible range of 26N+26N-1+261 and select the length based on where your returned number falls within there but it may be difficult to find a random number generator that'll work on numbers that large (10 characters gives you a possible range of 2610+...+261 which, unless I've done the math wrong, is 146,813,779,479,510).
If you can limit the maximum size so that your rnd function will work in the range, something like this should be workable:
def getlen(chars,maxlen):
assert maxlen >= 1
range = chars
sampspace = 0
for i in 1 .. maxlen:
sampspace = sampspace + range
range = range * chars
range = range / chars
val = rnd(sampspace)
sz = maxlen
while val < sampspace - range:
sampspace = sampspace - range
range = range / chars
sz = sz - 1
return sz
Once you have the length, I would then use your current algorithm to choose the actual characters to populate the string.
Explaining it further:
Let's say our alphabet only consists of "ab". The possible sets up to length 3 are [ab] (2), [ab][ab] (4) and [ab][ab][ab] (8). So there is a 8/14 chance of getting a length of 3, 4/14 of length 2 and 2/14 of length 1.
The 14 is the magic figure: it's the sum of all 2n for n = 1 to the maximum length. So, testing that pseudo-code above with chars = 2 and maxlen = 3:
assert maxlen >= 1 [okay]
range = chars [2]
sampspace = 0
for i in 1 .. 3:
i = 1:
sampspace = sampspace + range [0 + 2 = 2]
range = range * chars [2 * 2 = 4]
i = 2:
sampspace = sampspace + range [2 + 4 = 6]
range = range * chars [4 * 2 = 8]
i = 3:
sampspace = sampspace + range [6 + 8 = 14]
range = range * chars [8 * 2 = 16]
range = range / chars [16 / 2 = 8]
val = rnd(sampspace) [number from 0 to 13 inclusive]
sz = maxlen [3]
while val < sampspace - range: [see below]
sampspace = sampspace - range
range = range / chars
sz = sz - 1
return sz
So, from that code, the first iteration of the final loop will exit with sz = 3 if val is greater than or equal to sampspace - range [14 - 8 = 6]. In other words, for the values 6 through 13 inclusive, 8 of the 14 possibilities.
Otherwise, sampspace becomes sampspace - range [14 - 8 = 6] and range becomes range / chars [8 / 2 = 4].
Then the second iteration of the final loop will exit with sz = 2 if val is greater than or equal to sampspace - range [6 - 4 = 2]. In other words, for the values 2 through 5 inclusive, 4 of the 14 possibilities.
Otherwise, sampspace becomes sampspace - range [6 - 4 = 2] and range becomes range / chars [4 / 2 = 2].
Then the third iteration of the final loop will exit with sz = 1 if val is greater than or equal to sampspace - range [2 - 2 = 0]. In other words, for the values 0 through 1 inclusive, 2 of the 14 possibilities (this iteration will always exit since the value must be greater than or equal to zero.
In retrospect, that second solution is a bit of a nightmare. In my personal opinion, I'd go for the first solution for its simplicity and to avoid the possibility of rather large numbers.

Building on my comment posted as a reply to the OP:
I'd consider it an exercise in base
conversion. You're simply generating a
"random number" in "base 26", where
a=0 and z=25. For a random string of
length n, generate a number between 1
and 26^n. Convert from base 10 to base
26, using symbols from your chosen
alphabet.
Here's a PHP implementation. I won't guaranty that there isn't an off-by-one error or two in here, but any such error should be minor:
<?php
$n = 5;
var_dump(randstr($n));
function randstr($maxlen) {
$dict = 'abcdefghijklmnopqrstuvwxyz';
$rand = rand(0, pow(strlen($dict), $maxlen));
$str = base_convert($rand, 10, 26);
//base convert returns base 26 using 0-9 and 15 letters a-p(?)
//we must convert those to our own set of symbols
return strtr($str, '1234567890abcdefghijklmnopqrstuvwxyz', $dict);
}

Instead of picking a length with uniform distribution, weight it according to how many strings are a given length. If your alphabet is size m, there are mx strings of size x, and (1-mn+1)/(1-m) strings of length n or less. The probability of choosing a string of length x should be mx*(1-m)/(1-mn+1).
Edit:
Regarding overflow - using floating point instead of integers will expand the range, so for a 26-character alphabet and single-precision floats, direct weight calculation shouldn't overflow for n<26.
A more robust approach is to deal with it iteratively. This should also minimize the effects of underflow:
int randomLength() {
for(int i = n; i > 0; i--) {
double d = Math.random();
if(d > (m - 1) / (m - Math.pow(m, -i))) {
return i;
}
}
return 0;
}
To make this more efficient by calculating fewer random numbers, we can reuse them by splitting intervals in more than one place:
int randomLength() {
for(int i = n; i > 0; i -= 5) {
double d = Math.random();
double c = (m - 1) / (m - Math.pow(m, -i))
for(int j = 0; j < 5; j++) {
if(d > c) {
return i - j;
}
c /= m;
}
}
for(int i = n % 0; i > 0; i--) {
double d = Math.random();
if(d > (m - 1) / (m - Math.pow(m, -i))) {
return i;
}
}
return 0;
}

Edit: This answer isn't quite right. See the bottom for a disproof. I'll leave it up for now in the hope someone can come up with a variant that fixes it.
It's possible to do this without calculating the length separately - which, as others have pointed out, requires raising a number to a large power, and generally seems like a messy solution to me.
Proving that this is correct is a little tough, and I'm not sure I trust my expository powers to make it clear, but bear with me. For the purposes of the explanation, we're generating strings of length at most n from an alphabet a of |a| characters.
First, imagine you have a maximum length of n, and you've already decided you're generating a string of at least length n-1. It should be obvious that there are |a|+1 equally likely possibilities: we can generate any of the |a| characters from the alphabet, or we can choose to terminate with n-1 characters. To decide, we simply pick a random number x between 0 and |a| (inclusive); if x is |a|, we terminate at n-1 characters; otherwise, we append the xth character of a to the string. Here's a simple implementation of this procedure in Python:
def pick_character(alphabet):
x = random.randrange(len(alphabet) + 1)
if x == len(alphabet):
return ''
else:
return alphabet[x]
Now, we can apply this recursively. To generate the kth character of the string, we first attempt to generate the characters after k. If our recursive invocation returns anything, then we know the string should be at least length k, and we generate a character of our own from the alphabet and return it. If, however, the recursive invocation returns nothing, we know the string is no longer than k, and we use the above routine to select either the final character or no character. Here's an implementation of this in Python:
def uniform_random_string(alphabet, max_len):
if max_len == 1:
return pick_character(alphabet)
suffix = uniform_random_string(alphabet, max_len - 1)
if suffix:
# String contains characters after ours
return random.choice(alphabet) + suffix
else:
# String contains no characters after our own
return pick_character(alphabet)
If you doubt the uniformity of this function, you can attempt to disprove it: suggest a string for which there are two distinct ways to generate it, or none. If there are no such strings - and alas, I do not have a robust proof of this fact, though I'm fairly certain it's true - and given that the individual selections are uniform, then the result must also select any string with uniform probability.
As promised, and unlike every other solution posted thus far, no raising of numbers to large powers is required; no arbitrary length integers or floating point numbers are needed to store the result, and the validity, at least to my eyes, is fairly easy to demonstrate. It's also shorter than any fully-specified solution thus far. ;)
If anyone wants to chip in with a robust proof of the function's uniformity, I'd be extremely grateful.
Edit: Disproof, provided by a friend:
dato: so imagine alphabet = 'abc' and n = 2
dato: you have 9 strings of length 2, 3 of length 1, 1 of length 0
dato: that's 13 in total
dato: so probability of getting a length 2 string should be 9/13
dato: and probability of getting a length 1 or a length 0 should be 4/13
dato: now if you call uniform_random_string('abc', 2)
dato: that transforms itself into a call to uniform_random_string('abc', 1)
dato: which is an uniform distribution over ['a', 'b', 'c', '']
dato: the first three of those yield all the 2 length strings
dato: and the latter produce all the 1 length strings and the empty strings
dato: but 0.75 > 9/13
dato: and 0.25 < 4/13

// Note space as an available char
alphabet = "abcdefghijklmnopqrstuvwxyz "
result_string = ""
for( ;; )
{
s = ""
for( i = 0; i < n; i++ )
s += alphabet[rand(0, 26)]
first_space = n;
for( i = 0; i < n; i++ )
if( s[ i ] == ' ' )
{
first_space = i;
break;
}
ok = true;
// Reject "duplicate" shorter strings
for( i = first_space + 1; i < n; i++ )
if( s[ i ] != ' ' )
{
ok = false;
break;
}
if( !ok )
continue;
// Extract the short version of the string
for( i = 0; i < first_space; i++ )
result_string += s[ i ];
break;
}
Edit: I forgot to disallow 0-length strings, that will take a bit more code which I don't have time to add now.
Edit: After considering how my answer doesn't scale to large n (takes too long to get lucky and find an accepted string), I like paxdiablo's answer much better. Less code too.

Personally I'd do it like this:
Let's say your alphabet has Z characters. Then the number of possible strings for each length L is:
L | Z
--------------------------
1 | 26
2 | 676 (= 26 * 26)
3 | 17576 (= 26 * 26 * 26)
...and so on.
Now let's say your maximum desired length is N. Then the total number of possible strings from length 1 to N that your function could generate would be the sum of a geometric sequence:
(1 - (Z ^ (N + 1))) / (1 - Z)
Let's call this value S. Then the probability of generating a string of any length L should be:
(Z ^ L) / S
OK, fine. This is all well and good; but how do we generate a random number given a non-uniform probability distribution?
The short answer is: you don't. Get a library to do that for you. I develop mainly in .NET, so one I might turn to would be Math.NET.
That said, it's really not so hard to come up with a rudimentary approach to doing this on your own.
Here's one way: take a generator that gives you a random value within a known uniform distribution, and assign ranges within that distribution of sizes dependent on your desired distribution. Then interpret the random value provided by the generator by determining which range it falls into.
Here's an example in C# of one way you could implement this idea (scroll to the bottom for example output):
RandomStringGenerator class
public class RandomStringGenerator
{
private readonly Random _random;
private readonly char[] _alphabet;
public RandomStringGenerator(string alphabet)
{
if (string.IsNullOrEmpty(alphabet))
throw new ArgumentException("alphabet");
_random = new Random();
_alphabet = alphabet.Distinct().ToArray();
}
public string NextString(int maxLength)
{
// Get a value randomly distributed between 0.0 and 1.0 --
// this is approximately what the System.Random class provides.
double value = _random.NextDouble();
// This is where the magic happens: we "translate" the above number
// to a length based on our computed probability distribution for the given
// alphabet and the desired maximum string length.
int length = GetLengthFromRandomValue(value, _alphabet.Length, maxLength);
// The rest is easy: allocate a char array of the length determined above...
char[] chars = new char[length];
// ...populate it with a bunch of random values from the alphabet...
for (int i = 0; i < length; ++i)
{
chars[i] = _alphabet[_random.Next(0, _alphabet.Length)];
}
// ...and return a newly constructed string.
return new string(chars);
}
static int GetLengthFromRandomValue(double value, int alphabetSize, int maxLength)
{
// Looping really might not be the smartest way to do this,
// but it's the most obvious way that immediately springs to my mind.
for (int length = 1; length <= maxLength; ++length)
{
Range r = GetRangeForLength(length, alphabetSize, maxLength);
if (r.Contains(value))
return length;
}
return maxLength;
}
static Range GetRangeForLength(int length, int alphabetSize, int maxLength)
{
int L = length;
int Z = alphabetSize;
int N = maxLength;
double possibleStrings = (1 - (Math.Pow(Z, N + 1)) / (1 - Z));
double stringsOfGivenLength = Math.Pow(Z, L);
double possibleSmallerStrings = (1 - Math.Pow(Z, L)) / (1 - Z);
double probabilityOfGivenLength = ((double)stringsOfGivenLength / possibleStrings);
double probabilityOfShorterLength = ((double)possibleSmallerStrings / possibleStrings);
double startPoint = probabilityOfShorterLength;
double endPoint = probabilityOfShorterLength + probabilityOfGivenLength;
return new Range(startPoint, endPoint);
}
}
Range struct
public struct Range
{
public readonly double StartPoint;
public readonly double EndPoint;
public Range(double startPoint, double endPoint)
: this()
{
this.StartPoint = startPoint;
this.EndPoint = endPoint;
}
public bool Contains(double value)
{
return this.StartPoint <= value && value <= this.EndPoint;
}
}
Test
static void Main(string[] args)
{
const int N = 5;
const string alphabet = "acegikmoqstvwy";
int Z = alphabet.Length;
var rand = new RandomStringGenerator(alphabet);
var strings = new List<string>();
for (int i = 0; i < 100000; ++i)
{
strings.Add(rand.NextString(N));
}
Console.WriteLine("First 10 results:");
for (int i = 0; i < 10; ++i)
{
Console.WriteLine(strings[i]);
}
// sanity check
double sumOfProbabilities = 0.0;
for (int i = 1; i <= N; ++i)
{
double probability = Math.Pow(Z, i) / ((1 - (Math.Pow(Z, N + 1))) / (1 - Z));
int numStrings = strings.Count(str => str.Length == i);
Console.WriteLine("# strings of length {0}: {1} (probability = {2:0.00%})", i, numStrings, probability);
sumOfProbabilities += probability;
}
Console.WriteLine("Probabilities sum to {0:0.00%}.", sumOfProbabilities);
Console.ReadLine();
}
Output:
First 10 results:
wmkyw
qqowc
ackai
tokmo
eeiyw
cakgg
vceec
qwqyq
aiomt
qkyav
# strings of length 1: 1 (probability = 0.00%)
# strings of length 2: 38 (probability = 0.03%)
# strings of length 3: 475 (probability = 0.47%)
# strings of length 4: 6633 (probability = 6.63%)
# strings of length 5: 92853 (probability = 92.86%)
Probabilities sum to 100.00%.

My idea regarding this is like:
you have 1-n length string.there 26 possible 1 length string,26*26 2 length string and so on.
you can find out the percentage of each length string of the total possible strings.for example percentage of single length string is like
((26/(TOTAL_POSSIBLE_STRINGS_OF_ALL_LENGTH))*100).
similarly you can find out the percentage of other length strings.
Mark them on a number line between 1 to 100.ie suppose percentage of single length string is 3 and double length string is 6 then number line single length string lies between 0-3 while double length string lies between 3-9 and so on.
Now take a random number between 1 to 100.find out the range in which this number lies.I mean suppose for examplethe number you have randomly chosen is 2.Now this number lies between 0-3 so go 1 length string or if the random number chosen is 7 then go for double length string.
In this fashion you can see that length of each string choosen will be proportional to the percentage of the total number of that length string contribute to the all possible strings.
Hope I am clear.
Disclaimer: I have not gone through above solution except one or two.So if it matches with some one solution it will be purely a chance.
Also,I will welcome all the advice and positive criticism and correct me if I am wrong.
Thanks and regard
Mawia

Matthieu: Your idea doesn't work because strings with blanks are still more likely to be generated. In your case, with n=4, you could have the string 'ab' generated as 'a' + 'b' + '' + '' or '' + 'a' + 'b' + '', or other combinations. Thus not all the strings have the same chance of appearing.

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio