Compare two arrays of points [closed] - algorithm

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
This question does not appear to be about programming within the scope defined in the help center.
Closed 9 years ago.
Improve this question
I'm trying to find a way to find similarities in two arrays of different points. I drew circles around points that have similar patterns and I would like to do some kind of auto comparison in intervals of let's say 100 points and tell what coefficient of similarity is for that interval. As you can see it might not be perfectly aligned also so point-to-point comparison would not be a good solution also (I suppose). Patterns that are slightly misaligned could also mean that they are matching the pattern (but obviously with a smaller coefficient)
What similarity could mean (1 coefficient is a perfect match, 0 or less - is not a match at all):
Points 640 to 660 - Very similar (coefficient is ~0.8)
Points 670 to 690 - Quite similar (coefficient is ~0.5-~0.6)
Points 720 to 780 - Let's say quite similar (coefficient is ~0.5-~0.6)
Points 790 to 810 - Perfectly similar (coefficient is 1)
Coefficient is just my thoughts of how a final calculated result of comparing function could look like with given data.
I read many posts on SO but it didn't seem to solve my problem. I would appreciate your help a lot. Thank you
P.S. Perfect answer would be the one that provides pseudo code for function which could accept two data arrays as arguments (intervals of data) and return coefficient of similarity.
Click here to see original size of image

I also think High Performance Mark has basically given you the answer (cross-correlation). In my opinion, most of the other answers are only giving you half of what you need (i.e., dot product plus compare against some threshold). However, this won't consider a signal to be similar to a shifted version of itself. You'll want to compute this dot product N + M - 1 times, where N, M are the sizes of the arrays. For each iteration, compute the dot product between array 1 and a shifted version of array 2. The amount you shift array 2 increases by one each iteration. You can think of array 2 as a window you are passing over array 1. You'll want to start the loop with the last element of array 2 only overlapping the first element in array 1.
This loop will generate numbers for different amounts of shift, and what you do with that number is up to you. Maybe you compare it (or the absolute value of it) against a threshold that you define to consider two signals "similar".
Lastly, in many contexts, a signal is considered similar to a scaled (in the amplitude sense, not time-scaling) version of itself, so there must be a normalization step prior to computing the cross-correlation. This is usually done by scaling the elements of the array so that the dot product with itself equals 1. Just be careful to ensure this makes sense for your application numerically, i.e., integers don't scale very well to values between 0 and 1 :-)

i think HighPerformanceMarks's suggestion is the standard way of doing the job.
a computationally lightweight alternative measure might be a dot product.
split both arrays into the same predefined index intervals.
consider the array elements in each intervals as vector coordinates in high-dimensional space.
compute the dot product of both vectors.
the dot product will not be negative. if the two vectors are perpendicular in their vector space, the dot product will be 0 (in fact that's how 'perpendicular' is usually defined in higher dimensions), and it will attain its maximum for identical vectors.
if you accept the geometric notion of perpendicularity as a (dis)similarity measure, here you go.
caveat:
this is an ad hoc heuristic chosen for computational efficiency. i cannot tell you about mathematical/statistical properties of the process and separation properties - if you need rigorous analysis, however, you'll probably fare better with correlation theory anyway and should perhaps forward your question to math.stackexchange.com.

My Attempt:
Total_sum=0
1. For each index i in the range (m,n)
2. sum=0
3. k=Array1[i]*Array2[i]; t1=magnitude(Array1[i]); t2=magnitude(Array2[i]);
4. k=k/(t1*t2)
5. sum=sum+k
6. Total_sum=Total_sum+sum
Coefficient=Total_sum/(m-n)
If all values are equal, then sum would return 1 in each case and total_sum would return (m-n)*(1). Hence, when the same is divided by (m-n) we get the value as 1. If the graphs are exact opposites, we get -1 and for other variations a value between -1 and 1 is returned.
This is not so efficient when the y range or the x range is huge. But, I just wanted to give you an idea.
Another option would be to perform an extensive xnor.
1. For each index i in the range (m,n)
2. sum=1
3. k=Array1[i] xnor Array2[i];
4. k=k/((pow(2,number_of_bits))-1) //This will scale k down to a value between 0 and 1
5. sum=(sum+k)/2
Coefficient=sum
Is this helpful ?

You can define a distance metric for two vectors A and B of length N containing numbers in the interval [-1, 1] e.g. as
sum = 0
for i in 0 to 99:
d = (A[i] - B[i])^2 // this is in range 0 .. 4
sum = (sum / 4) / N // now in range 0 .. 1
This now returns distance 1 for vectors that are completely opposite (one is all 1, another all -1), and 0 for identical vectors.
You can translate this into your coefficient by
coeff = 1 - sum
However, this is a crude approach because it does not take into account the fact that there could be horizontal distortion or shift between the signals you want to compare, so let's look at some approaches for coping with that.
You can sort both your arrays (e.g. in ascending order) and then calculate the distance / coefficient. This returns more similarity than the original metric, and is agnostic towards permutations / shifts of the signal.
You can also calculate the differentials and calculate distance / coefficient for those, and then you can do that sorted also. Using differentials has the benefit that it eliminates vertical shifts. Sorted differentials eliminate horizontal shift but still recognize different shapes better than sorted original data points.
You can then e.g. average the different coefficients. Here more complete code. The routine below calculates coefficient for arrays A and B of given size, and takes d many differentials (recursively) first. If sorted is true, the final (differentiated) array is sorted.
procedure calc(A, B, size, d, sorted):
if (d > 0):
A' = new array[size - 1]
B' = new array[size - 1]
for i in 0 to size - 2:
A'[i] = (A[i + 1] - A[i]) / 2 // keep in range -1..1 by dividing by 2
B'[i] = (B[i + 1] - B[i]) / 2
return calc(A', B', size - 1, d - 1, sorted)
else:
if (sorted):
A = sort(A)
B = sort(B)
sum = 0
for i in 0 to size - 1:
sum = sum + (A[i] - B[i]) * (A[i] - B[i])
sum = (sum / 4) / size
return 1 - sum // return the coefficient
procedure similarity(A, B, size):
sum a = 0
a = a + calc(A, B, size, 0, false)
a = a + calc(A, B, size, 0, true)
a = a + calc(A, B, size, 1, false)
a = a + calc(A, B, size, 1, true)
return a / 4 // take average
For something completely different, you could also run Fourier transform using FFT and then take a distance metric on the returning spectra.

Related

Conditional sampling of binary vectors (?)

I'm trying to find a name for my problem, so I don't have to re-invent wheel when coding an algorithm which solves it...
I have say 2,000 binary (row) vectors and I need to pick 500 from them. In the picked sample I do column sums and I want my sample to be as close as possible to a pre-defined distribution of the column sums. I'll be working with 20 to 60 columns.
A tiny example:
Out of the vectors:
110
010
011
110
100
I need to pick 2 to get column sums 2, 1, 0. The solution (exact in this case) would be
110
100
My ideas so far
one could maybe call this a binary multidimensional knapsack, but I did not find any algos for that
Linear Programming could help, but I'd need some step by step explanation as I got no experience with it
as exact solution is not always feasible, something like simulated annealing brute force could work well
a hacky way using constraint solvers comes to mind - first set the constraints tight and gradually loosen them until some solution is found - given that CSP should be much faster than ILP...?
My concrete, practical (if the approximation guarantee works out for you) suggestion would be to apply the maximum entropy method (in Chapter 7 of Boyd and Vandenberghe's book Convex Optimization; you can probably find several implementations with your favorite search engine) to find the maximum entropy probability distribution on row indexes such that (1) no row index is more likely than 1/500 (2) the expected value of the row vector chosen is 1/500th of the predefined distribution. Given this distribution, choose each row independently with probability 500 times its distribution likelihood, which will give you 500 rows on average. If you need exactly 500, repeat until you get exactly 500 (shouldn't take too many tries due to concentration bounds).
Firstly I will make some assumptions regarding this problem:
Regardless whether the column sum of the selected solution is over or under the target, it weighs the same.
The sum of the first, second, and third column are equally weighted in the solution (i.e. If there's a solution whereas the first column sum is off by 1, and another where the third column sum is off by 1, the solution are equally good).
The closest problem I can think of this problem is the Subset sum problem, which itself can be thought of a special case of Knapsack problem.
However both of these problem are NP-Complete. This means there are no polynomial time algorithm that can solve them, even though it is easy to verify the solution.
If I were you the two most arguably efficient solution of this problem are linear programming and machine learning.
Depending on how many columns you are optimising in this problem, with linear programming you can control how much finely tuned you want the solution, in exchange of time. You should read up on this, because this is fairly simple and efficient.
With Machine learning, you need a lot of data sets (the set of vectors and the set of solutions). You don't even need to specify what you want, a lot of machine learning algorithms can generally deduce what you want them to optimise based on your data set.
Both solution has pros and cons, you should decide which one to use yourself based on the circumstances and problem set.
This definitely can be modeled as (integer!) linear program (many problems can). Once you have it, you can use a program such as lpsolve to solve it.
We model vector i is selected as x_i which can be 0 or 1.
Then for each column c, we have a constraint:
sum of all (x_i * value of i in column c) = target for column c
Taking your example, in lp_solve this could look like:
min: ;
+x1 +x4 +x5 >= 2;
+x1 +x4 +x5 <= 2;
+x1 +x2 +x3 +x4 <= 1;
+x1 +x2 +x3 +x4 >= 1;
+x3 <= 0;
+x3 >= 0;
bin x1, x2, x3, x4, x5;
If you are fine with a heuristic based search approach, here is one.
Go over the list and find the minimum squared sum of the digit wise difference between each bit string and the goal. For example, if we are looking for 2, 1, 0, and we are scoring 0, 1, 0, we would do it in the following way:
Take the digit wise difference:
2, 0, 1
Square the digit wise difference:
4, 0, 1
Sum:
5
As a side note, squaring the difference when scoring is a common method when doing heuristic search. In your case, it makes sense because bit strings that have a 1 in as the first digit are a lot more interesting to us. In your case this simple algorithm would pick first 110, then 100, which would is the best solution.
In any case, there are some optimizations that could be made to this, I will post them here if this kind of approach is what you are looking for, but this is the core of the algorithm.
You have a given target binary vector. You want to select M vectors out of N that have the closest sum to the target. Let's say you use the eucilidean distance to measure if a selection is better than another.
If you want an exact sum, have a look at the k-sum problem which is a generalization of the 3SUM problem. The problem is harder than the subset sum problem, because you want an exact number of elements to add to a target value. There is a solution in O(N^(M/2)). lg N), but that means more than 2000^250 * 7.6 > 10^826 operations in your case (in the favorable case where vectors operations have a cost of 1).
First conclusion: do not try to get an exact result unless your vectors have some characteristics that may reduce the complexity.
Here's a hill climbing approach:
sort the vectors by number of 1's: 111... first, 000... last;
use the polynomial time approximate algorithm for the subset sum;
you have an approximate solution with K elements. Because of the order of elements (the big ones come first), K should be a little as possible:
if K >= M, you take the M first vectors of the solution and that's probably near the best you can do.
if K < M, you can remove the first vector and try to replace it with 2 or more vectors from the rest of the N vectors, using the same technique, until you have M vectors. To sumarize: split the big vectors into smaller ones until you reach the correct number of vectors.
Here's a proof of concept with numbers, in Python:
import random
def distance(x, y):
return abs(x-y)
def show(ls):
if len(ls) < 10:
return str(ls)
else:
return ", ".join(map(str, ls[:5]+("...",)+ls[-5:]))
def find(is_xs, target):
# see https://en.wikipedia.org/wiki/Subset_sum_problem#Pseudo-polynomial_time_dynamic_programming_solution
S = [(0, ())] # we store indices along with values to get the path
for i, x in is_xs:
T = [(x + t, js + (i,)) for t, js in S]
U = sorted(S + T)
y, ks = U[0]
S = [(y, ks)]
for z, ls in U:
if z == target: # use the euclidean distance here if you want an approximation
return ls
if z != y and z < target:
y, ks = z, ls
S.append((z, ls))
ls = S[-1][1] # take the closest element to target
return ls
N = 2000
M = 500
target = 1000
xs = [random.randint(0, 10) for _ in range(N)]
print ("Take {} numbers out of {} to make a sum of {}", M, xs, target)
xs = sorted(xs, reverse = True)
is_xs = list(enumerate(xs))
print ("Sorted numbers: {}".format(show(tuple(is_xs))))
ls = find(is_xs, target)
print("FIRST TRY: {} elements ({}) -> {}".format(len(ls), show(ls), sum(x for i, x in is_xs if i in ls)))
splits = 0
while len(ls) < M:
first_x = xs[ls[0]]
js_ys = [(i, x) for i, x in is_xs if i not in ls and x != first_x]
replace = find(js_ys, first_x)
splits += 1
if len(replace) < 2 or len(replace) + len(ls) - 1 > M or sum(xs[i] for i in replace) != first_x:
print("Give up: can't replace {}.\nAdd the lowest elements.")
ls += tuple([i for i, x in is_xs if i not in ls][len(ls)-M:])
break
print ("Replace {} (={}) by {} (={})".format(ls[:1], first_x, replace, sum(xs[i] for i in replace)))
ls = tuple(sorted(ls[1:] + replace)) # use a heap?
print("{} elements ({}) -> {}".format(len(ls), show(ls), sum(x for i, x in is_xs if i in ls)))
print("AFTER {} splits, {} -> {}".format(splits, ls, sum(x for i, x in is_xs if i in ls)))
The result is obviously not guaranteed to be optimal.
Remarks:
Complexity: find has a polynomial time complexity (see the Wikipedia page) and is called at most M^2 times, hence the complexity remains polynomial. In practice, the process is reasonably fast (split calls have a small target).
Vectors: to ensure that you reach the target with the minimum of elements, you can improve the order of element. Your target is (t_1, ..., t_c): if you sort the t_js from max to min, you get the more importants columns first. You can sort the vectors: by number of 1s and then by the presence of a 1 in the most important columns. E.g. target = 4 8 6 => 1 1 1 > 0 1 1 > 1 1 0 > 1 0 1 > 0 1 0 > 0 0 1 > 1 0 0 > 0 0 0.
find (Vectors) if the current sum exceed the target in all the columns, then you're not connecting to the target (any vector you add to the current sum will bring you farther from the target): don't add the sum to S (z >= target case for numbers).
I propose a simple ad hoc algorithm, which, broadly speaking, is a kind of gradient descent algorithm. It seems to work relatively well for input vectors which have a distribution of 1s “similar” to the target sum vector, and probably also for all “nice” input vectors, as defined in a comment of yours. The solution is not exact, but the approximation seems good.
The distance between the sum vector of the output vectors and the target vector is taken to be Euclidean. To minimize it means minimizing the sum of the square differences off sum vector and target vector (the square root is not needed because it is monotonic). The algorithm does not guarantee to yield the sample that minimizes the distance from the target, but anyway makes a serious attempt at doing so, by always moving in some locally optimal direction.
The algorithm can be split into 3 parts.
First of all the first M candidate output vectors out of the N input vectors (e.g., N=2000, M=500) are put in a list, and the remaining vectors are put in another.
Then "approximately optimal" swaps between vectors in the two lists are done, until either the distance would not decrease any more, or a predefined maximum number of iterations is reached. An approximately optimal swap is one where removing the first vector from the list of output vectors causes a maximal decrease or minimal increase of the distance, and then, after the removal of the first vector, adding the second vector to the same list causes a maximal decrease of the distance. The whole swap is avoided if the net result is not a decrease of the distance.
Then, as a last phase, "optimal" swaps are done, again stopping on no decrease in distance or maximum number of iterations reached. Optimal swaps cause a maximal decrease of the distance, without requiring the removal of the first vector to be optimal in itself. To find an optimal swap all vector pairs have to be checked. This phase is much more expensive, being O(M(N-M)), while the previous "approximate" phase is O(M+(N-M))=O(N). Luckily, when entering this phase, most of the work has already been done by the previous phase.
from typing import List, Tuple
def get_sample(vects: List[Tuple[int]], target: Tuple[int], n_out: int,
max_approx_swaps: int = None, max_optimal_swaps: int = None,
verbose: bool = False) -> List[Tuple[int]]:
"""
Get a sample of the input vectors having a sum close to the target vector.
Closeness is measured in Euclidean metrics. The output is not guaranteed to be
optimal (minimum square distance from target), but a serious attempt is made.
The max_* parameters can be used to avoid too long execution times,
tune them to your needs by setting verbose to True, or leave them None (∞).
:param vects: the list of vectors (tuples) with the same number of "columns"
:param target: the target vector, with the same number of "columns"
:param n_out: the requested sample size
:param max_approx_swaps: the max number of approximately optimal vector swaps,
None means unlimited (default: None)
:param max_optimal_swaps: the max number of optimal vector swaps,
None means unlimited (default: None)
:param verbose: print some info if True (default: False)
:return: the sample of n_out vectors having a sum close to the target vector
"""
def square_distance(v1, v2):
return sum((e1 - e2) ** 2 for e1, e2 in zip(v1, v2))
n_vec = len(vects)
assert n_vec > 0
assert n_out > 0
n_rem = n_vec - n_out
assert n_rem > 0
output = vects[:n_out]
remain = vects[n_out:]
n_col = len(vects[0])
assert n_col == len(target) > 0
sumvect = (0,) * n_col
for outvect in output:
sumvect = tuple(map(int.__add__, sumvect, outvect))
sqdist = square_distance(sumvect, target)
if verbose:
print(f"sqdist = {sqdist:4} after"
f" picking the first {n_out} vectors out of {n_vec}")
if max_approx_swaps is None:
max_approx_swaps = sqdist
n_approx_swaps = 0
while sqdist and n_approx_swaps < max_approx_swaps:
# find the best vect to subtract (the square distance MAY increase)
sqdist_0 = None
index_0 = None
sumvect_0 = None
for index in range(n_out):
tmp_sumvect = tuple(map(int.__sub__, sumvect, output[index]))
tmp_sqdist = square_distance(tmp_sumvect, target)
if sqdist_0 is None or sqdist_0 > tmp_sqdist:
sqdist_0 = tmp_sqdist
index_0 = index
sumvect_0 = tmp_sumvect
# find the best vect to add,
# but only if there is a net decrease of the square distance
sqdist_1 = sqdist
index_1 = None
sumvect_1 = None
for index in range(n_rem):
tmp_sumvect = tuple(map(int.__add__, sumvect_0, remain[index]))
tmp_sqdist = square_distance(tmp_sumvect, target)
if sqdist_1 > tmp_sqdist:
sqdist_1 = tmp_sqdist
index_1 = index
sumvect_1 = tmp_sumvect
if sumvect_1:
tmp = output[index_0]
output[index_0] = remain[index_1]
remain[index_1] = tmp
sqdist = sqdist_1
sumvect = sumvect_1
n_approx_swaps += 1
else:
break
if verbose:
print(f"sqdist = {sqdist:4} after {n_approx_swaps}"
f" approximately optimal swap{'s'[n_approx_swaps == 1:]}")
diffvect = tuple(map(int.__sub__, sumvect, target))
if max_optimal_swaps is None:
max_optimal_swaps = sqdist
n_optimal_swaps = 0
while sqdist and n_optimal_swaps < max_optimal_swaps:
# find the best pair to swap,
# but only if the square distance decreases
best_sqdist = sqdist
best_diffvect = diffvect
best_pair = None
for i0 in range(M):
tmp_diffvect = tuple(map(int.__sub__, diffvect, output[i0]))
for i1 in range(n_rem):
new_diffvect = tuple(map(int.__add__, tmp_diffvect, remain[i1]))
new_sqdist = sum(d * d for d in new_diffvect)
if best_sqdist > new_sqdist:
best_sqdist = new_sqdist
best_diffvect = new_diffvect
best_pair = (i0, i1)
if best_pair:
tmp = output[best_pair[0]]
output[best_pair[0]] = remain[best_pair[1]]
remain[best_pair[1]] = tmp
sqdist = best_sqdist
diffvect = best_diffvect
n_optimal_swaps += 1
else:
break
if verbose:
print(f"sqdist = {sqdist:4} after {n_optimal_swaps}"
f" optimal swap{'s'[n_optimal_swaps == 1:]}")
return output
from random import randrange
C = 30 # number of columns
N = 2000 # total number of vectors
M = 500 # number of output vectors
F = 0.9 # fill factor of the target sum vector
T = int(M * F) # maximum value + 1 that can be appear in the target sum vector
A = 10000 # maximum number of approximately optimal swaps, may be None (∞)
B = 10 # maximum number of optimal swaps, may be None (unlimited)
target = tuple(randrange(T) for _ in range(C))
vects = [tuple(int(randrange(M) < t) for t in target) for _ in range(N)]
sample = get_sample(vects, target, M, A, B, True)
Typical output:
sqdist = 2639 after picking the first 500 vectors out of 2000
sqdist = 9 after 27 approximately optimal swaps
sqdist = 1 after 4 optimal swaps
P.S.: As it stands, this algorithm is not limited to binary input vectors, integer vectors would work too. Intuitively I suspect that the quality of the optimization could suffer, though. I suspect that this algorithm is more appropriate for binary vectors.
P.P.S.: Execution times with your kind of data are probably acceptable with standard CPython, but get better (like a couple of seconds, almost a factor of 10) with PyPy. To handle bigger sets of data, the algorithm would have to be translated to C or some other language, which should not be difficult at all.

Tiling a Triangular Grid

Design and explain a recursive divide-and-conquer algorithm. Anyone has ideas?
Given an isosceles right triangular grid for some k ≥ 2 as shown in Figure 1(b), this problem asks you to completely cover it using the tiles given in Figure 1(a). The bottom-left corner of the grid must not be covered. No two tiles can overlap and all tiles must remain completely inside the given triangular grid. You must use all four types of tiles shown in Figure 1(a), and no tile type can be used to cover more than 40% of the total grid area. You are allowed to rotate the tiles, as needed, before putting them on the grid.
This is the idea of induction indeed, and is similar to the famous example "L-Tile" covering
As you said, you have solved the problem for k = 2, it's a good and correct starting point to solving small example first, yet I think this problem there is a bit tricky for k = 2 case, mainly due to the each type cannot exceed 40% constrain.
Then for k>2, say k = 3 in your example, we try to make use of what you have solved, i.e. the case k = 2
With very simple observation, one may notice that for k = n, it can actually be made up of 4 k=n-1 cases (see image below)
Now the shaded part in the middle form a hole that can filled by 1 type B, so we can first filled the 4 small n-1 case and fill the hole with type B...
But then this construction face a problem: type B will exceed 40% of the area!
Consider k = 2, no matter how you fill the area, 2 type B must be used, I do not have a strong proof but by some brute force trail & error you should be convinced. Then for k = 3, we have 4 small triangles meaning we have 2*4 = 8 Type B, plus 1 more to fill the hole will gives us 9 Type B, each uses 1.5 sq units, which total uses up 13.5 sq units.
As k = 3, the total area is (2^3)^2 / 2 = 32 sq units
13.5/32 = 0.42.... which violate the constrain!
So what to do? Here is the reason why we have to use a trick to handle the k = 2 case (I assume you have go through this part as you said you know how to do k = 2 case)
First, we know that using our constructive method to build a large triangle from 4 smaller triangles, only Type B will violate this constrain (i.e. the 40% area), you can verify yourself. So we want to reduce the total number of Type B used, yet each smaller triangle must use at least 2 Type B, so the only place we may reduce is the empty hole in the middle of the large triangle, can we use other Type instead of Type B? At the same time, we want the other parts of the small triangle remain unchanged so that we can use same argument to do an induction (i.e. in general speaking, form 2^n triangle from 4 2^(n-1) triangles using same construction method)
The answer is YES if we special design the k = 2 case
See my construction below: (There maybe other construction works too, but I only need to know one)
The trick is I intentionally move 1 Type B next to the empty(gray) triangle
Let's stop right here for a bit, and do some verification:
To construct a k = 2 case, we use
2 Type A = 2 sq.units < 40%
2 Type B = 3 sq.units < 40%
1 Type C = 1.5 sq.units < 40%
1 Type D = 1 sq.unit < 40%
Total use 7.5 sq.units, good
Now imagine we use exactly the same method to construct those 4 triangles to make a large one, the middle one still be an empty hole with shape of Type B, but now instead of filling it with 1 Type B, we fill the hole TOGETHER WITH the 3 Type B just next to them (look back the k = 2 case), using Type A & D
(I use same color scheme as above for easy understanding), we do this for all 3 small triangles which made up the hole in the middle.
Here is the last part (I know it's long...)
We have reduce the number of Type B used when constructing a large triangle from smaller ones, but at the same time we increase the number of Type A & D used! So is this new construction method valid?
First notice that it does not change any parts of the small triangles except the Type B next to the gray triangle, i.e. If the 40% constrain is fulfilled, this method is inductive and recursive to fill a 2^n side triangle
Then let's count again the number of each Type we used.
For k = 3, total units is 32, we uses:
2*4+3 = 11 Type A = 11 sq.units < 40%
2*4-3 = 5 Type B = 7.5 sq.units < 40%
1*4 = 4 Type C = 6 sq.units < 40%
1*4+3 = 7 Type D = 7 sq.unit < 40%
Total we cover 31.5 units, good, now let's proof the 40% constrain is satisfied for k = n > 3
Let FA(n-1) be the total area of Type A used to fill 2^n-1 triangles using our new method, likewise, FB(n-1), FC(n-1), FD(n-1) with similar definitions
Assume F*(n-1) is true, i.e. not exceeding 40% of total area, we proof that F*(n) is true.
We got
FA(n) = FA(n-1)*4 + 3*1
FB(n) = FB(n-1)*4 - 3*1.5
FC(n) = FC(n-1)*4
FD(n) = FD(n-1)*4 + 3*1
We only show the proof for FD(n), other three should be proofed with similar method (M.I.)
Using method of substitution, FD(n) = 2*(4^(n-2)) - 1 for n>=3 (You should at least try to come up with this equation yourself)
We want to show FD(n)/(2^2(n)/2) < 0.4
i.e. 2FD(n)/4^n < 0.4
Consider LHS,
LHS = (4*(4^(n-2)) - 1)/4^n
< 4^(n-1)/4^n = 1/4 < 0.4 Q.E.D
That means using this method, all Type A-D will not exceed 40% of total area for any 2^k sided triangle, for k >= 3, finally we show that inductively, there is a method satisfy all constrains to construct such a triangle.
TL;DR
The hard part is to satisfy the 40% area constrain
Use a special construction on k = 2 case first, try to use it to build k = 3 case (then k = 4, k = 5...idea of induction!)
When using k=n-1 case to build k=n case, write down the formula of total area consumed by each type, and show that they would not exceed 40% of total areas
Combined point 2 & 3, it's an induction method to show that for any k >= 2, there is a method (which we described) to fill the 2^k sided triangle without breaking any constrains

Non-linear comparison sorting / scoring

I have an array I want to sort based on assigning a score to each element in the array.
Let's say the possible score range is 0-100. And to get that score we are going to use 2 comparison data points, one with a weighting of 75 and one with a weighting of 25. Let's call them valueA and valueB. And we will transpose each value into a score. So:
valueA (range = 0-10,000)
valueB (range = 0-70)
scoreA (range = 0 - 75)
scoreB (range = 0 - 25)
scoreTotal = scoreA + scoreB (0 - 100)
Now the question is how to transpose valueA to scoreA in a non-linear way with heavier weighting for being close to the min value. What I mean by that is that for valueA, 0 would be a perfect score (75), but a value of say 20 would give a mid-point score of 37.5 and a value of say 100 would give a very low score of say 5, and then everything greater would trend towards 0 (e.g. a value of 5,000 would be essentially 0). Ideally I could setup a curve with a few data points (say 4 quartile points) and then the algorithm would fit to that curve. Or maybe the simplest solution is to create a bunch of points on the curve (say 10) and do a linear transposition between each of those 10 points? But I'm hoping there is a much simpler algorithm to accomplish this without figuring out all the points on the curve myself and then having to tweak 10+ variables. I'd rather 1 or 2 inputs to define how steep the curve is. Possible?
I don't need something super complex or accurate, just a simple algorithm so there is greater weighting for being close to the min of the range, and way less weighting for being close to the max of the range. Hopefully this makes sense.
My stats math is so rusty I'm not even sure what this is called for searching for a solution. All those years of calculus and statistics for naught.
I'm implementing this in Objective C, but any c-ish/java-ish pseudo code would be fine.
A function you may want to try is
max / [(log(x+2)/log(2))^N]
where max is either 75 or 25 in your case. The log(x+2)/log(2) part ensures that f(0) == max (you can substitute log(x+C)/log(C) here for any C > 0; a higher C will slow the curve's descent); the ^N determines how quickly your function drops to 0 (you can play around with the function here to get a picture of what's going on)

matlab: optimum amount of points for linear fit

I want to make a linear fit to few data points, as shown on the image. Since I know the intercept (in this case say 0.05), I want to fit only points which are in the linear region with this particular intercept. In this case it will be lets say points 5:22 (but not 22:30).
I'm looking for the simple algorithm to determine this optimal amount of points, based on... hmm, that's the question... R^2? Any Ideas how to do it?
I was thinking about probing R^2 for fits using points 1 to 2:30, 2 to 3:30, and so on, but I don't really know how to enclose it into clear and simple function. For fits with fixed intercept I'm using polyfit0 (http://www.mathworks.com/matlabcentral/fileexchange/272-polyfit0-m) . Thanks for any suggestions!
EDIT:
sample data:
intercept = 0.043;
x = 0.01:0.01:0.3;
y = [0.0530642513911393,0.0600786706929529,0.0673485248329648,0.0794662409166333,0.0895915873196170,0.103837395346484,0.107224784565365,0.120300492775786,0.126318699218730,0.141508831492330,0.147135757370947,0.161734674733680,0.170982455701681,0.191799936622712,0.192312642057298,0.204771365716483,0.222689541632988,0.242582251060963,0.252582727297656,0.267390860166283,0.282890010610515,0.292381165948577,0.307990544720676,0.314264952297699,0.332344368808024,0.355781519885611,0.373277721489254,0.387722683944356,0.413648156978284,0.446500064130389;];
What you have here is a rather difficult problem to find a general solution of.
One approach would be to compute all the slopes/intersects between all consecutive pairs of points, and then do cluster analysis on the intersepts:
slopes = diff(y)./diff(x);
intersepts = y(1:end-1) - slopes.*x(1:end-1);
idx = kmeans(intersepts, 3);
x([idx; 3] == 2) % the points with the intersepts closest to the linear one.
This requires the statistics toolbox (for kmeans). This is the best of all methods I tried, although the range of points found this way might have a few small holes in it; e.g., when the slopes of two points in the start and end range lie close to the slope of the line, these points will be detected as belonging to the line. This (and other factors) will require a bit more post-processing of the solution found this way.
Another approach (which I failed to construct successfully) is to do a linear fit in a loop, each time increasing the range of points from some point in the middle towards both of the endpoints, and see if the sum of the squared error remains small. This I gave up very quickly, because defining what "small" is is very subjective and must be done in some heuristic way.
I tried a more systematic and robust approach of the above:
function test
%% example data
slope = 2;
intercept = 1.5;
x = linspace(0.1, 5, 100).';
y = slope*x + intercept;
y(1:12) = log(x(1:12)) + y(12)-log(x(12));
y(74:100) = y(74:100) + (x(74:100)-x(74)).^8;
y = y + 0.2*randn(size(y));
%% simple algorithm
[X,fn] = fminsearch(#(ii)P(ii, x,y,intercept), [0.5 0.5])
[~,inds] = P(X, y,x,intercept)
end
function [C, inds] = P(ii, x,y,intercept)
% ii represents fraction of range from center to end,
% So ii lies between 0 and 1.
N = numel(x);
n = round(N/2);
ii = round(ii*n);
inds = min(max(1, n+(-ii(1):ii(2))), N);
% Solve linear system with fixed intercept
A = x(inds);
b = y(inds) - intercept;
% and return the sum of squared errors, divided by
% the number of points included in the set. This
% last step is required to prevent fminsearch from
% reducing the set to 1 point (= minimum possible
% squared error).
C = sum(((A\b)*A - b).^2)/numel(inds);
end
which only finds a rough approximation to the desired indices (12 and 74 in this example).
When fminsearch is run a few dozen times with random starting values (really just rand(1,2)), it gets more reliable, but I still wouln't bet my life on it.
If you have the statistics toolbox, use the kmeans option.
Depending on the number of data values, I would split the data into a relative small number of overlapping segments, and for each segment calculate the linear fit, or rather the 1-st order coefficient, (remember you know the intercept, which will be same for all segments).
Then, for each coefficient calculate the MSE between this hypothetical line and entire dataset, choosing the coefficient which yields the smallest MSE.

Algorithm to identify a unique free polyomino (or polyomino hash)

In short: How to hash a free polyomino?
This could be generalized into: How to efficiently hash an arbitrary collection of 2D integer coordinates, where a set contains unique pairs of non-negative integers, and a set is considered unique if and only if no translation, rotation, or flip can map it identically to another set?
For impatient readers, please note I'm fully aware of a brute force approach. I'm looking for a better way -- or a very convincing proof that no other way can exist.
I'm working on some different algorithms to generate random polyominos. I want to test their output to determine how random they are -- i.e. are certain instances of a given order generated more frequently than others. Visually, it is very easy to identify different orientations of a free polyomino, for example the following Wikipedia illustration shows all 8 orientations of the "F" pentomino (Source):
How would one put a number on this polyomino - that is, hash a free polyomino? I don't want to depend on a prepolulated list of "named" polyominos. Broadly agreed-upon names only exists for orders 4 and 5, anyway.
This is not necessarily equavalent to enumerating all free (or one-sided, or fixed) polyominos of a given order. I only want to count the number of times a given configuration appears. If a generating algorithm never produces a certain polyomino it will simply not be counted.
The basic logic of the counting is:
testcount = 10000 // Arbitrary
order = 6 // Create hexominos in this test
hashcounts = new hashtable
for i = 1 to testcount
poly = GenerateRandomPolyomino(order)
hash = PolyHash(poly)
if hashcounts.contains(hash) then
hashcounts[hash]++
else
hashcounts[hash] = 1
What I'm looking for is an efficient PolyHash algorithm. The input polyominos are simply defined as a set of coordinates. One orientation of the T tetronimo could be, for example:
[[1,0], [0,1], [1,1], [2,1]]:
|012
-+---
0| X
1|XXX
You can assume that that input polyomino will already be normalized to be aligned against the X and Y axes and have only positive coordinates. Formally, each set:
Will have at least 1 coordinate where the x value is 0
Will have at least 1 coordinate where the y value is 0
Will not have any coordinates where x < 0 or y < 0
I'm really looking for novel algorithms that avoid the increasing number of integer operations required by a general brute force approach, described below.
Brute force
A brute force solution suggested here and here consists of hashing each set as an unsigned integer using each coordinate as a binary flag, and taking the minimum hash of all possible rotations (and in my case flips), where each rotation / flip must also be translated to the origin. This results in a total of 23 set operations for each input set to get the "free" hash:
Rotate (6x)
Flip (1x)
Translate (7x)
Hash (8x)
Find minimum of computed hashes (1x)
Where the sequence of operations to obtain each hash is:
Hash
Rotate, Translate, Hash
Rotate, Translate, Hash
Rotate, Translate, Hash
Flip, Translate, Hash
Rotate, Translate, Hash
Rotate, Translate, Hash
Rotate, Translate, Hash
Well, I came up with a completely different approach. (Also thanks to corsiKa for some helpful insights!) Rather than hashing / encoding the squares, encode the path around them. The path consists of a sequence of 'turns' (including no turn) to perform before drawing each unit segment. I think an algorithm for getting the path from the coordinates of the squares is outside the scope of this question.
This does something very important: it destroys all location and orientation information, which we don't need. It is also very easy to get the path of the flipped object: you do so by simply reversing the order of the elements. Storage is compact because each element requires only 2 bits.
It does introduce one additional constraint: the polyomino must not have fully enclosed holes. (Formally, it must be simply connected.) Most discussions of polyominos consider a hole to exist even if it is sealed only by two touching corners, as this prevents tiling with any other non-trivial polyomino. Tracing the edges is not hindered by touching corners (as in the single heptomino with a hole), but it cannot leap from one outer loop to an inner one as in the complete ring-shaped octomino:
It also produces one additional challenge: finding the minumum ordering of the encoded path loop. This is because any rotation of the path (in the sense of string rotation) is a valid encoding. To always get the same encoding we have to find the minimal (or maximal) rotation of the path instructions. Thankfully this problem has already been solved: see for example http://en.wikipedia.org/wiki/Lexicographically_minimal_string_rotation.
Example:
If we arbitrarily assign the following values to the move operations:
No turn: 1
Turn right: 2
Turn left: 3
Here is the F pentomino traced clockwise:
An arbitrary initial encoding for the F pentomino is (starting at the bottom right corner):
2,2,3,1,2,2,3,2,2,3,2,1
The resulting minimum rotation of the encoding is
1,2,2,3,1,2,2,3,2,2,3,2
With 12 elements, this loop can be packed into 24 bits if two bits are used per instruction or only 19 bits if instructions are encoded as powers of three. Even with the 2-bit element encoding can easily fit that in a single unsigned 32 bit integer 0x6B6BAE:
1- 2- 2- 3- 1- 2- 2- 3- 2- 2- 3- 2
= 01-10-10-11-01-10-10-11-10-10-11-10
= 00000000011010110110101110101110
= 0x006B6BAE
The base-3 encoding with the start of the loop in the most significant powers of 3 is 0x5795F:
1*3^11 + 2*3^10 + 2*3^9 + 3*3^8 + 1*3^7 + 2*3^6
+ 2*3^5 + 3*3^4 + 2*3^3 + 2*3^2 + 3*3^1 + 2*3^0
= 0x0005795F
The maximum number of vertexes in the path around a polyomino of order n is 2n + 2. For 2-bit encoding the number of bits is twice the number of moves, so the maximum bits needed is 4n + 4. For base-3 encoding it's:
Where the "gallows" is the ceiling function. Accordingly any polyomino up to order 9 can be encoded in a single 32 bit integer. Knowing this you can choose your platform-specific data structure accordingly for the fastest hash comparison given the maximum order of the polyominos you'll be hashing.
You can reduce it down to 8 hash operations without the need to flip, rotate, or re-translate.
Note that this algorithm assumes you are operating with coordinates relative to itself. That is to say it's not in the wild.
Instead of applying operations that flip, rotate, and translate, instead simply change the order in which you hash.
For instance, let us take the F pent above. In the simple example, let us presume the hash operation was something like this:
int hashPolySingle(Poly p)
int hash = 0
for x = 0 to p.width
fory = 0 to p.height
hash = hash * 31 + p.contains(x,y) ? 1 : 0
hashPolySingle = hash
int hashPoly(Poly p)
int hash = hashPolySingle(p)
p.rotateClockwise() // assume it translates inside
hash = hash * 31 + hashPolySingle(p)
// keep rotating for all 4 oritentations
p.flip()
// hash those 4
Instead of applying the function to all 8 different orientations of the poly, I would apply 8 different hash functions to 1 poly.
int hashPolySingle(Poly p, bool flip, int corner)
int hash = 0
int xstart, xstop, ystart, ystop
bool yfirst
switch(corner)
case 1: xstart = 0
xstop = p.width
ystart = 0
ystop = p.height
yfirst = false
break
case 2: xstart = p.width
xstop = 0
ystart = 0
ystop = p.height
yfirst = true
break
case 3: xstart = p.width
xstop = 0
ystart = p.height
ystop = 0
yfirst = false
break
case 4: xstart = 0
xstop = p.width
ystart = p.height
ystop = 0
yfirst = true
break
default: error()
if(flip) swap(xstart, xstop)
if(flip) swap(ystart, ystop)
if(yfirst)
for y = ystart to ystop
for x = xstart to xstop
hash = hash * 31 + p.contains(x,y) ? 1 : 0
else
for x = xstart to xstop
for y = ystart to ystop
hash = hash * 31 + p.contains(x,y) ? 1 : 0
hashPolySingle = hash
Which is then called in the 8 different ways. You could also encapsulate hashPolySingle in for loop around the corner, and around the flip or not. All the same.
int hashPoly(Poly p)
// approach from each of the 4 corners
int hash = hashPolySingle(p, false, 1)
hash = hash * 31 + hashPolySingle(p, false, 2)
hash = hash * 31 + hashPolySingle(p, false, 3)
hash = hash * 31 + hashPolySingle(p, false, 4)
// flip it
hash = hash * 31 + hashPolySingle(p, true, 1)
hash = hash * 31 + hashPolySingle(p, true, 2)
hash = hash * 31 + hashPolySingle(p, true, 3)
hash = hash * 31 + hashPolySingle(p, true, 4)
hashPoly = hash
In this way, you're implicitly rotating the poly from each direction, but you're not actually performing the rotation and translation. It performs the 8 hashes, which seem to be entirely necessary in order to accurately hash all 8 orientations, but wastes no passes over the poly that are not actually doing hashes. This seems to me to be the most elegant solution.
Note that there may be a better hashPolySingle() algorithm to use. Mine uses a Cartesian exhaustion algorithm that is on the order of O(n^2). Its worst case scenario is an L shape, which would cause there to be an N/2 * (N-1)/2 sized square for only N elements, or an efficiency of 1:(N-1)/4, compared to an I shape which would be 1:1. It may also be that the inherent invariant imposed by the architecture would actually make it less efficient than the naive algorithm.
My suspicion is that the above concern can be alleviated by simulating the Cartesian exhaustion by converting the set of nodes into an bi-directional graph that can be traversed, causing the nodes to be hit in the same order as my much more naive hashing algorithm, ignoring the empty spaces. This will bring the algorithm down to O(n) as the graph should be able to be constructed in O(n) time. Because I haven't done this, I can't say for sure, which is why I say it's only a suspicion, but there should be a way to do it.
Here's my DFS (depth first search) explained:
Start with the top-most cell (left-most as a tiebreaker). Mark it as visited. Every time you visit a cell, check all four directions for unvisited neighbors. Always check the four directions in this order: up, left, down, right.
Example
In this example, up and left fail, but down succeeds. So far our output is 001, and we recursively search the "down" cell.
We mark our new current cell as visited (and we'll finish searching the original cell when we finish searching this cell). Here, up=0, left=1.
We search the left-most cell and there are no unvisted neighbors (up=0, left=0, down=0, right=0). Our total output so far is 001010000.
We continue our search of the second cell. down=0, right=1. We search the cell to the right.
up=0, left=0, down=1. Search the down cell: all 0s. Total output so far is 001010000010010000. Then, we return from the down cell...
right=0, return. return. (Now, we are at the starting cell.) right=0. Done!
So, the total output is 20 (N*4) bits: 00101000001001000000.
Encoding improvement
But, we can save some bits.
The last visited cell will always encode 0000 for its four directions. So, don't encode the last visited cell to save 4 bits.
Another improvement: if you reached a cell by moving left, don't check that cells right-side. So, we only need 3 bits per cell, except 4 bits for the first cell, and 0 for the last cell.
The first cell will never have an up, or left neighbor, so omit these bits. So the first cell takes 2 bits.
So, with these improvements, we use only N*3-4 bits (e.g. 5 cells -> 11 bits; 9 cells -> 23 bits).
If you really want, you can compact a little more by noting that exactly N-1 bits will be "1".
Caveat
Yes, you'll need to encode all 8 rotations/flips of the polyomino and choose the least to get a canonical encoding.
I suspect this will still be faster than the outline approach. Also, holes in the polyomino shouldn't be a problem.
I worked on the same problem recently. I solved the problem fairly simply by
(1) generate a unique ID for a polyomino, such that each identical poly would have the same UID. For example, find the bounding box, normalize the corner of the bounding box, and collect the set of non-empty cells.
(2) generate all possible permutations by rotating (and flipping, if appropriate) a polyomino, and look for duplicates.
The advantage of this brute approach, other than it's simplicity, is that it still works if the
polys are distinguishable in some other way, for example if some of them are colored or numbered.
You can set up something like a trie to uniquely identify (and not just hash) your polyomino. Take your normalized polyomino and set up a binary search tree, where the root branches on whether (0,0) is has a set pixel, the next level branches on whether (0,1) has a set pixel, and so on. When you look up a polyomino, simply normalize it and then walk the tree. If you find it in the trie, then you're done. If not, assign that polyomino a unique id (just increment a counter), generate all 8 possible rotations and flips, then add those 8 to the trie.
On a trie miss, you'll have to generate all the rotations and reflections. But on a trie hit it should cost less (O(k^2) for k-polyominos).
To make lookups even more efficient, you could use a couple bits at a time and use a wider tree instead of a binary tree.
A valid hash function, if you're really afraid of hash collisions, is to make a hash function x + order * y for coordinates and then loop trough all the coordinates of a piece, adding (order ^ i) * hash(coord[i]) to the piece hash. That way, you can guarantee you won't get any hash collisions.

Resources