Querying large amount of multidimensional points in R^N - algorithm

I'm looking at listing/counting the number of integer points in R^N (in the sense of Euclidean space), within certain geometric shapes, such as circles and ellipses, subject to various conditions, for small N. By this I mean that N < 5, and the conditions are polynomial inequalities.
As a concrete example, take R^2. One of the queries I might like to run is "How many integer points are there in an ellipse (parameterised by x = 4 cos(theta), y = 3 sin(theta) ), such that y * x^2 - x * y = 4?"
I could implement this in Haskell like this:
ghci> let latticePoints = [(x,y) | x <- [-4..4], y <-[-3..3], 9*x^2 + 16*y^2 <= 144, y*x^2 - x*y == 4]
and then I would have:
ghci> latticePoints
[(-1,2),(2,2)]
Which indeed answers my question.
Of course, this is a very naive implementation, but it demonstrates what I'm trying to achieve. (I'm also only using Haskell here as I feel it most directly expresses the underlying mathematical ideas.)
Now, if I had something like "In R^5, how many integer points are there in a 4-sphere of radius 1,000,000, satisfying x^3 - y + z = 20?", I might try something like this:
ghci> :{
Prelude| let latticePoints2 = [(x,y,z,w,v) | x <-[-1000..1000], y <- [-1000..1000],
Prelude| z <- [-1000..1000], w <- [-1000..1000], v <-[1000..1000],
Prelude| x^2 + y^2 + z^2 + w^2 + v^2 <= 1000000, x^3 - y + z == 20]
Prelude| :}
so if I now type:
ghci> latticePoints2
Not much will happen...
I imagine the issue is because it's effectively looping through 2000^5 (32 quadrillion!) points, and it's clearly unreasonably of me to expect my computer to deal with that. I can't imagine doing a similar implementation in Python or C would help matters much either.
So if I want to tackle a large number of points in such a way, what would be my best bet in terms of general algorithms or data structures? I saw in another thread (Count number of points inside a circle fast), someone mention quadtrees as well as K-D trees, but I wouldn't know how to implement those, nor how to appropriately query one once it was implemented.
I'm aware some of these numbers are quite large, but the biggest circles, ellipses, etc I'd be dealing with are of radius 10^12 (one trillion), and I certainly wouldn't need to deal with R^N with N > 5. If the above is NOT possible, I'd be interested to know what sort of numbers WOULD be feasible?

There is no general way to solve this problem. The problem of finding integer solutions to algebraic equations (equations of this sort are called Diophantine equations) is known to be undecidable. Apparently, you can write equations of this sort such that solving the equations ends up being equivalent to deciding whether a given Turing machine will halt on a given input.
In the examples you've listed, you've always constrained the points to be on some well-behaved shape, like an ellipse or a sphere. While this particular class of problem is definitely decidable, I'm skeptical that you can efficiently solve these problems for more complex curves. I suspect that it would be possible to construct short formulas that describe curves that are mostly empty but have a huge bounding box.
If you happen to know more about the structure of the problems you're trying to solve - for example, if you're always dealing with spheres or ellipses - then you may be able to find fast algorithms for this problem. In general, though, I don't think you'll be able to do much better than brute force. I'm willing to admit that (and in fact, hopeful that) someone will prove me wrong about this, though.

The idea behind the kd-tree method is that you recursive subdivide the search box and try to rule out whole boxes at a time. Given the current box, use some method that either (a) declares that all points in the box match the predicate (b) declares that no points in the box match the predicate (c) makes no declaration (one possibility, which may be particularly convenient in Haskell: interval arithmetic). On (c), cut the box in half (say along the longest dimension) and recursively count in the halves. Obviously the method can choose (c) all the time, which devolves to brute force; the goal here is to do (a) or (b) as much as possible.
The performance of this method is very dependent on how it's instantiated. Try it -- it shouldn't be more than a couple dozen lines of code.

For nicely connected region, assuming your shape is significantly smaller than your containing search space, and given a seed point, you could do a growth/building algorithm:
Given a seed point:
Push seed point into test-queue
while test-queue has items:
Pop item from test-queue
If item tests to be within region (eg using a callback function):
Add item to inside-set
for each neighbour point (generated on the fly):
if neighbour not in outside-set and neighbour not in inside-set:
Add neighbour to test-queue
else:
Add item to outside-set
return inside-set
The trick is to find an initial seed point that is inside the function.
Make sure your set implementation gives O(1) duplicate checking. This method will eventually break down with large numbers of dimensions as the surface area exceeds the volume, but for 5 dimensions should be fine.

Related

what is the right algorithm for ... eh, I don't know what it's called

Suppose I have a long and irregular digital signal made up of smaller but irregular signals occurring at various times (and overlapping each other). We will call these shorter signals the "pieces" that make up the larger signal. By "irregular" I mean that it is not a specific frequency or pattern.
Given the long signal I need to find the optimal arrangement of pieces that produce (as closely as possible) the larger signal. I know what the pieces look like but I don't know how many of them exist in the full signal (or how many times any one piece exists in the full signal). What software algorithm would you use to do this optimization? What do I search for on the web to get help on solving this problem?
Here's a stab at it.
This is actually the easier of the deconvolution problems. It is easier in that you may be able to have a unique answer. The harder problem is that you also don't know what the pieces look like. That case is called blind deconvolution. It is a harder problem and is usually iterative and statistical (ML or MAP), and the solution may not be right.
Luckily, your case is easier, but still not so easy because you have multiple pieces :p
I think that it may be commonly called mixture deconvolution?
So let f[t] for t=1,...N be your long signal. Let h1[t]...hn[t] for t=0,1,2,...M be your short signals. Obviously here, N>>M.
So your hypothesis is that:
(1) f[t] = h1[t+a1[1]]+h1[t+a1[2]] + ...
+h2[t+a2[1]]+h2[t+a2[2]] + ...
+....
+hn[t+an[1]]+h2[t+an[2]] + ...
Observe that each row of that equation is actually hj * uj where uj is the sum of shifted Kronecker delta. The * here is convolution.
So now what?
Let Hj be the (maybe transposed depending on how you look at it) Toeplitz matrix generated by hj, then the equation above becomes:
(2) F = H1 U1 + H2 U2 + ... Hn Un
subject to the constraint that uj[k] must be either 0 or 1.
where F is the vector [f[0],...F[N]] and Uj is the vector [uj[0],...uj[N]].
So you can rewrite this as:
(3) F = H * U
where H = [H1 ... Hn] (horizontal concatenation) and U = [U1; ... ;Un] (vertical concatenation).
H is an Nx(nN) matrix. U is an nN vector.
Ok, so the solution space is finite. It is 2^(nN) in size. So you can try all possible combinations to see which one gives you the lowest ||F - H*U||, but that will take too long.
What you can do is solve equation (3) using pseudo-inverse, multi-linear regression (which uses least square, which comes out to pseudo-inverse), or something like this
Is it possible to solve a non-square under/over constrained matrix using Accelerate/LAPACK?
Then move that solution around within the null space of H to get a solution subject to the constraint that uj[k] must be either 0 or 1.
Alternatively, you can use something like Nelder-Mead or Levenberg-Marquardt to find the minimum of:
||F - H U|| + lambda g(U)
where g is a regularization function defined as:
g(U) = ||U - U*||
where U*[j] = 0 if |U[j]|<|U[j]-1|, else 1
Ok, so I have no idea if this will converge. If not, you have to come up with your own regularizer. It's kinda dumb to use a generalized nonlinear optimizer when you have a set of linear equations.
In reality, you're going to have noise and what not, so it actually may not be a bad idea to use something like MAP and apply the small pieces as prior.

Minimize a function

Suppose you are given a function of a single variable and arguments a and b and are asked to find the minimum value that the function takes on the interval [a, b]. (You can assume that the argument is a double, though in my application I may need to use an arbitrary-precision library.)
In general this is a hard problem because functions can be weird. A simple version of this problem would be to minimize the function assuming that it is continuous (no gaps or jumps) and single-peaked (there is a unique minimum; to the left of the minimum the function is decreasing and to the right it is increasing). Is there a good way to solve this easier (but perhaps not easy!) problem?
Assume that the function may be difficult to calculate but not particularly expensive to store an answer that you've computed. (Obviously, it's better if you don't have to make giant arrays of key/value pairs.)
Bonus points for good ideas on improving the algorithm in the fortunate case in which it's nice (e.g.: derivative exists, function is smooth/analytic, derivative can be computed in closed form, derivative can be computed at no cost when the function is evaluated).
The version you describe, with a single minimum, is easy to solve.
The idea is this. Suppose that I have 3 points with a < b < c and f(b) < f(a) and f(b) < f(c). Then the true minimum is between a and c. Furthermore if I pick another point d somewhere in the interval, then I can throw away one of a or d and still have an interval with the true minimum in the middle. My approximations will improve exponentially quickly as I do more iterations.
We don't quite start with this. We start with 2 points, a and b, and know that the answer is somewhere in the middle. Take the mid-point. If f there is below the end points, we're into the case I discussed above. Otherwise it must be below one of the end points, and above the other. We can throw away the higher end point and repeat.
If the function is nice, i.e., single-peaked and strictly monotonic (i.e., strictly decreasing to the left of the minimum and strictly increasing to the right), then you can find the minimum with binary search:
Set x = (b-a)/2
test whether x is to the right of the minimum or to the left
if x is left of the minimum:b = x
if x is right of the minimum:a = x
repeat from start until you get bored
the minimum is at x
To test whether x is left/right of the minimum, invent a small value epsilon and check whether f(x - epsilon) < f(x + epsilon). If it is, the minimum is to the left, otherwise it's to the right. By "until you get bored", I mean: invent another small value delta and stop if fabs(f(x - epsilon) - f(x + epsilon)) < delta.
Note that in the general case where you don't know anything about the behavior of a function f, it's not possible to decide a non-trivial property of f. Well, unless you're willing to try all possible inputs. See Rice's Theorem for details.
The Boost project has an implementation of Brent's algorithm that may be useful.
It seems to assume that the function is continuous, and has no maxima (only a minimum) in the input interval.
Not a direct answer but a pointer to more reading:
scipy.optimize: http://docs.scipy.org/doc/scipy/reference/optimize.html
section e04 of naglib: http://www.nag.co.uk/numeric/cl/nagdoc_cl09/html/genint/libconts.html
For the special case where the function is differentiable twice (and the two derivatives can be calculated easily), one can use Newton's method for optimization, i.e. essentially finding the roots of the first derivative (which is a necessary condition for the minimum).
Concerning the general case, note that the extreme case of 'weird' is a function which is continuous nowhere and for which it is very hard if not impossible to find the minimum (in finite time). So I guess you should try to make at least some assumptions about the function you are trying to minimize.
What you want is to optimize an Unimodal function. The correct algorithm is similar to btilly's but you need extra points.
Take 4 points a < b < c < d.
We want to minimize f in [a,d].
If f(b) < f(c) we know the minimum is in [a, c]
If f(b) > f(c) " " " " is in [b, d]
This can give an algorithm by itself, but there is a nice trick involving the golden ratio that allows you to reuse the intermediate values (in a way you only need to compute f once per iteration instead of twice)
If you have an expression for the function, there are global optimization algorithms based on interval analysis.

Why don't genetic algorithms work on problems like factoring RSA?

Some time ago i was pretty interested in GAs and i studied about them quite a bit. I used C++ GAlib to write some programs and i was quite amazed by their ability to solve otherwise difficult to compute problems, in a matter of seconds. They seemed like a great bruteforcing technique that works really really smart and adapts.
I was reading a book by Michalewitz, if i remember the name correctly and it all seemed to be based on the Schema Theorem, proved by MIT.
I've also heard that it cannot really be used to approach problems like factoring RSA private keys.
Could anybody explain why this is the case ?
Genetic Algorithm are not smart at all, they are very greedy optimizer algorithms. They all work around the same idea. You have a group of points ('a population of individuals'), and you transform that group into another one with stochastic operator, with a bias in the direction of best improvement ('mutation + crossover + selection'). Repeat until it converges or you are tired of it, nothing smart there.
For a Genetic Algorithm to work, a new population of points should perform close to the previous population of points. Little perturbation should creates little change. If, after a small perturbation of a point, you obtain a point that represents a solution with completely different performance, then, the algorithm is nothing better than random search, a usually not good optimization algorithm. In the RSA case, if your points are directly the numbers, it's either YES or NO, just by flipping a bit... Thus using a Genetic Algorithm is no better than random search, if you represents the RSA problem without much thinking "let's code search points as the bits of the numbers"
I would say because factorisation of keys is not an optimisation problem, but an exact problem. This distinction is not very accurate, so here are details.
Genetic algorithms are great to solve problems where the are minimums (local/global), but there aren't any in the factorising problem. Genetic algorithm as DCA or Simulated annealing needs a measure of "how close I am to the solution" but you can't say this for our problem.
For an example of problem genetics are good, there is the hill climbing problem.
GAs are based on fitness evaluation of candidate solutions.
You basically have a fitness function that takes in a candidate solution as input and gives you back a scalar telling you how good that candidate is. You then go on and allow the best individuals of a given generation to mate with higher probability than the rest, so that the offspring will be (hopefully) more 'fit' overall, and so on.
There is no way to evaluate fitness (how good is a candidate solution compared to the rest) in the RSA factorization scenario, so that's why you can't use them.
GAs are not brute-forcing, they’re just a search algorithm. Each GA essentially looks like this:
candidates = seed_value;
while (!good_enough(best_of(candidates))) {
candidates = compute_next_generation(candidates);
}
Where good_enough and best_of are defined in terms of a fitness function. A fitness function says how well a given candidate solves the problem. That seems to be the core issue here: how would you write a fitness function for factorization? For example 20 = 2*10 or 4*5. The tuples (2,10) and (4,5) are clearly winners, but what about the others? How “fit” is (1,9) or (3,4)?
Indirectly, you can use a genetic algorithm to factor an integer N. Dixon's integer factorization method uses equations involving powers of the first k primes, modulo N. These products of powers of small primes are called "smooth". If we are using the first k=4 primes - {2,3,5,7} - 42=2x3x7 is smooth and 11 is not (for lack of a better term, 11 is "rough"). Dixon's method requires an invertible k x k matrix consisting of the exponents that define these smooth numbers. For more on Dixon's method see https://en.wikipedia.org/wiki/Dixon%27s_factorization_method.
Now, back to the original question: There is a genetic algorithm for finding equations for Dixon's method.
Let r be the inverse of a smooth number mod N - so r is a rough number
Let s be smooth
Generate random solutions of rx = sy mod N. These solutions [x,y] are the population for the genetic algorithm. Each x, y has a smooth component and a rough component. For example suppose x = 369 = 9 x 41. Then (assuming 41 is not small enough to count as smooth), the rough part of x is 41 and the smooth part is 9.
Choose pairs of solutions - "parents" - to combine into linear combinations with ever smaller rough parts.
The algorithm terminates when a pair [x,y] is found with rough parts [1,1], [1,-1],[-1,1] or [-1,-1]. This yields an equation for Dixon's method, because rx=sy mod N and r is the only rough number left: x and y are smooth, and s started off smooth. But even 1/r mod N is smooth, so it's all smooth!
Every time you combine two pairs - say [v,w] and [x,y] - the smooth parts of the four numbers are obliterated, except for the factors the smooth parts of v and x share, and the factors the smooth parts of w and y share. So we choose parents that share smooth parts to the greatest possible extent. To make this precise, write
g = gcd(smooth part of v, smooth part of x)
h = gcd(smooth part of w, smooth part of y)
[v,w], [x,y] = [g v/g, h w/h], [g x/g, h y/h].
The hard-won smooth factors g and h will be preserved into the next generation, but the smooth parts of v/g, w/h, x/g and y/h will be sacrificed in order to combine [v,w] and [x,y]. So we choose parents for which v/g, w/h, x/g and y/h have the smallest smooth parts. In this way we really do drive down the rough parts of our solutions to rx = sy mod N from one generation to the next.
On further thought the best way to make your way towards smooth coefficients x, y in the lattice ax = by mod N is with regression, not a genetic algorithm.
Two regressions are performed, one with response vector R0 consisting of x-values from randomly chosen solutions of ax = by mod N; and the other with response vector R1 consisting of y-values from the same solutions. Both regressions use the same explanatory matrix X. In X are columns consisting of the remainders of the x-values modulo smooth divisors, and other columns consisting of the remainders of the y-values modulo other smooth divisors.
The best choice of smooth divisors is the one that minimizes the errors from each regression:
E0 = R0 - X (inverse of (X-transpose)(X)) (X-transpose) (R0)
E1 = R1 - X (inverse of (X-transpose)(X)) (X-transpose) (R1)
What follows is row operations to annihilate X. Then apply a result z of these row operations to the x- and y-values from the original solutions from which X was formed.
z R0 = z R0 - 0
= z R0 - zX (inverse of (X-transpose)(X)) (X-transpose) (R0)
= z E0
Similarly, z R1 = z E1
Three properties are now combined in z R0 and z R1:
They are multiples of large smooth numbers, because z annihilates remainders modulo smooth numbers.
They are relatively small, since E0 and E1 are small.
Like any linear combination of solutions to ax = by mod N, z R0 and z R1 are themselves solutions to that equation.
A relatively small multiple of a large smooth number might just be the smooth number itself. Having a smooth solution of ax = by mod N yields an input to Dixon's method.
Two optimizations make this particularly fast:
There is no need to guess all the smooth numbers and columns of X at once. You can run regressions continuosly, adding one column to X at a time, choosing columns that reduce E0 and E1 the most. At no time will any two smooth numbers with a common factor be selected.
You can also start with a lot of random solutions of zx = by mod N, and remove the ones with the largest errors between selections of new columns for X.

Constraint Satisfaction: Choosing real numbers with certain characteristics

I have a set of n real numbers. I also have a set of functions,
f_1, f_2, ..., f_m.
Each of these functions takes a list of numbers as its argument. I also have a set of m ranges,
[l_1, u_1], [l_2, u_2], ..., [l_m, u_m].
I want to repeatedly choose a subset {r_1, r_2, ..., r_k} of k elements such that
l_i <= f_i({r_1, r_2, ..., r_k}) <= u_i for 1 <= i <= m.
Note that the functions are smooth. Changing one element in {r_1, r_2, ..., r_k} will not change f_i({r_1, r_2, ..., r_k}) by much. average and variance are two f_i that are commonly used.
These are the m constraints that I need to satisfy.
Moreover I want to do this so that the set of subsets I choose is uniformly distributed over the set of all subsets of size k that satisfy these m constraints. Not only that, but I want to do this in an efficient manner. How quickly it runs will depend on the density of solutions within the space of all possible solutions (if this is 0.0, then the algorithm can run forever). (Assume that f_i (for any i) can be computed in a constant amount of time.)
Note that n is large enough that I cannot brute-force the problem. That is, I cannot just iterate through all k-element subsets and find which ones satisfy the m constraints.
Is there a way to do this?
What sorts of techniques are commonly used for a CSP like this? Can someone point me in the direction of good books or articles that talk about problems like this (not just CSPs in general, but CSPs involving continuous, as opposed to discrete values)?
Assuming you're looking to write your own application and use existing libraries to do this, there are choices in many languages, like Python-constraint, or Cream or Choco for Java, or CSP for C++. The way you've described the problem it sound like you're looking for a general purpose CSP solver. Are there any properties of your functions that may help reduce the complexity, such as being monotonic?
Given the problem as you've described it, you can pick from each range r_i uniformly and throw away any m-dimensional point that fails to meet the criterion. It will be uniformly distributed because the original is uniformly distributed and the set of subsets is a binary mask over the original.
Without knowing more about the shape of f, you can't make any guarantees about whether time is polynomial or not (or even have any idea of how to hit a spot that meets the constraint). After all, if f_1 = (x^2 + y^2 - 1) and f_2 = (1 - x^2 - y^2) and the constraints are f_1 < 0 and f_2 < 0, you can't satisfy this at all (and without access to the analytic form of the functions, you could never know for sure).
Given the information in your message, I'm not sure it can be done at all...
Consider:
numbers = {1....100}
m = 1 (keep it simple)
F1 = Average
L1 = 10
U1 = 50
Now, how many subset of {1...100} can you come up with that produces an average between 10 & 50?
This looks like a very hard problem. For the simplest case with linear functions you could take a look at linear programming.

Which algorithm will be required to do this?

I have data of this form:
for x=1, y is one of {1,4,6,7,9,18,16,19}
for x=2, y is one of {1,5,7,4}
for x=3, y is one of {2,6,4,8,2}
....
for x=100, y is one of {2,7,89,4,5}
Only one of the values in each set is the correct value, the rest is random noise.
I know that the correct values describe a sinusoid function whose parameters are unknown. How can I find the correct combination of values, one from each set?
I am looking something like "travelling salesman"combinatorial optimization algorithm
You're trying to do curve fitting, for which there are several algorithms depending on the type of curve you want to fit your curve to (linear, polynomial, etc.). I have no idea whether there is a specific algorithm for sinusoidal curves (Fourier approximations), but my first idea would be to use a polynomial fitting algorithm with a polynomial approximation of the sine.
I wonder whether you need to do this in the course of another larger program, or whether you are trying to do this task on its own. If so, then you'd be much better off using a statistical package, my preferred one being R. It allows you to import your data and fit curves and draw graphs in just a few lines, and you could also use R in batch-mode to call it from a script or even a program (this is what I tend to do).
It depends on what you mean by "exactly", and what you know beforehand. If you know the frequency w, and that the sinusoid is unbiased, you have an equation
a cos(w * x) + b sin(w * x)
with two (x,y) points at different x values you can find a and b, and then check the generated curve against all the other points. Choose the two x values with the smallest number of y observations and try it for all the y's. If there is a bias, i.e. your equation is
a cos(w * x) + b sin(w * x) + c
You need to look at three x values.
If you do not know the frequency, you can try the same technique, unfortunately the solutions may not be unique, there may be more than one w that fits.
Edit As I understand your problem, you have a real y value for each x and a bunch of incorrect ones. You want to find the real values. The best way to do this is to fit curves through a small number of points and check to see if the curve fits some y value in the other sets.
If not all the x values have valid y values then the same technique applies, but you need to look at a much larger set of pairs, triples or quadruples (essentially every pair, triple, or quad of points with different y values)
If your problem is something else, and I suspect it is, please specify it.
Define sinusoid. Most people take that to mean a function of the form a cos(w * x) + b sin(w * x) + c. If you mean something different, specify it.
2 Specify exactly what success looks like. An example with say 10 points instead of 100 would be nice.
It is extremely unclear what this has to do with combinatorial optimization.
Sinusoidal equations are so general that if you take any random value of all y's these values can be fitted in sinusoidal function unless you give conditions eg. Frequency<100 or all parameters are integers,its not possible to diffrentiate noise and data theorotically so work on finding such conditions from your data source/experiment first.
By sinusoidal, do you mean a function that is increasing for n steps, then decreasing for n steps, etc.? If so, you you can model your data as a sequence of nodes connected by up-links and down-links. For each node (possible value of y), record the length and end-value of chains of only ascending or descending links (there will be multiple chain per node). Then you scan for consecutive runs of equal length and opposite direction, modulo some initial offset.

Resources