Constraint Satisfaction: Choosing real numbers with certain characteristics - algorithm

I have a set of n real numbers. I also have a set of functions,
f_1, f_2, ..., f_m.
Each of these functions takes a list of numbers as its argument. I also have a set of m ranges,
[l_1, u_1], [l_2, u_2], ..., [l_m, u_m].
I want to repeatedly choose a subset {r_1, r_2, ..., r_k} of k elements such that
l_i <= f_i({r_1, r_2, ..., r_k}) <= u_i for 1 <= i <= m.
Note that the functions are smooth. Changing one element in {r_1, r_2, ..., r_k} will not change f_i({r_1, r_2, ..., r_k}) by much. average and variance are two f_i that are commonly used.
These are the m constraints that I need to satisfy.
Moreover I want to do this so that the set of subsets I choose is uniformly distributed over the set of all subsets of size k that satisfy these m constraints. Not only that, but I want to do this in an efficient manner. How quickly it runs will depend on the density of solutions within the space of all possible solutions (if this is 0.0, then the algorithm can run forever). (Assume that f_i (for any i) can be computed in a constant amount of time.)
Note that n is large enough that I cannot brute-force the problem. That is, I cannot just iterate through all k-element subsets and find which ones satisfy the m constraints.
Is there a way to do this?
What sorts of techniques are commonly used for a CSP like this? Can someone point me in the direction of good books or articles that talk about problems like this (not just CSPs in general, but CSPs involving continuous, as opposed to discrete values)?

Assuming you're looking to write your own application and use existing libraries to do this, there are choices in many languages, like Python-constraint, or Cream or Choco for Java, or CSP for C++. The way you've described the problem it sound like you're looking for a general purpose CSP solver. Are there any properties of your functions that may help reduce the complexity, such as being monotonic?

Given the problem as you've described it, you can pick from each range r_i uniformly and throw away any m-dimensional point that fails to meet the criterion. It will be uniformly distributed because the original is uniformly distributed and the set of subsets is a binary mask over the original.
Without knowing more about the shape of f, you can't make any guarantees about whether time is polynomial or not (or even have any idea of how to hit a spot that meets the constraint). After all, if f_1 = (x^2 + y^2 - 1) and f_2 = (1 - x^2 - y^2) and the constraints are f_1 < 0 and f_2 < 0, you can't satisfy this at all (and without access to the analytic form of the functions, you could never know for sure).

Given the information in your message, I'm not sure it can be done at all...
Consider:
numbers = {1....100}
m = 1 (keep it simple)
F1 = Average
L1 = 10
U1 = 50
Now, how many subset of {1...100} can you come up with that produces an average between 10 & 50?

This looks like a very hard problem. For the simplest case with linear functions you could take a look at linear programming.

Related

Querying large amount of multidimensional points in R^N

I'm looking at listing/counting the number of integer points in R^N (in the sense of Euclidean space), within certain geometric shapes, such as circles and ellipses, subject to various conditions, for small N. By this I mean that N < 5, and the conditions are polynomial inequalities.
As a concrete example, take R^2. One of the queries I might like to run is "How many integer points are there in an ellipse (parameterised by x = 4 cos(theta), y = 3 sin(theta) ), such that y * x^2 - x * y = 4?"
I could implement this in Haskell like this:
ghci> let latticePoints = [(x,y) | x <- [-4..4], y <-[-3..3], 9*x^2 + 16*y^2 <= 144, y*x^2 - x*y == 4]
and then I would have:
ghci> latticePoints
[(-1,2),(2,2)]
Which indeed answers my question.
Of course, this is a very naive implementation, but it demonstrates what I'm trying to achieve. (I'm also only using Haskell here as I feel it most directly expresses the underlying mathematical ideas.)
Now, if I had something like "In R^5, how many integer points are there in a 4-sphere of radius 1,000,000, satisfying x^3 - y + z = 20?", I might try something like this:
ghci> :{
Prelude| let latticePoints2 = [(x,y,z,w,v) | x <-[-1000..1000], y <- [-1000..1000],
Prelude| z <- [-1000..1000], w <- [-1000..1000], v <-[1000..1000],
Prelude| x^2 + y^2 + z^2 + w^2 + v^2 <= 1000000, x^3 - y + z == 20]
Prelude| :}
so if I now type:
ghci> latticePoints2
Not much will happen...
I imagine the issue is because it's effectively looping through 2000^5 (32 quadrillion!) points, and it's clearly unreasonably of me to expect my computer to deal with that. I can't imagine doing a similar implementation in Python or C would help matters much either.
So if I want to tackle a large number of points in such a way, what would be my best bet in terms of general algorithms or data structures? I saw in another thread (Count number of points inside a circle fast), someone mention quadtrees as well as K-D trees, but I wouldn't know how to implement those, nor how to appropriately query one once it was implemented.
I'm aware some of these numbers are quite large, but the biggest circles, ellipses, etc I'd be dealing with are of radius 10^12 (one trillion), and I certainly wouldn't need to deal with R^N with N > 5. If the above is NOT possible, I'd be interested to know what sort of numbers WOULD be feasible?
There is no general way to solve this problem. The problem of finding integer solutions to algebraic equations (equations of this sort are called Diophantine equations) is known to be undecidable. Apparently, you can write equations of this sort such that solving the equations ends up being equivalent to deciding whether a given Turing machine will halt on a given input.
In the examples you've listed, you've always constrained the points to be on some well-behaved shape, like an ellipse or a sphere. While this particular class of problem is definitely decidable, I'm skeptical that you can efficiently solve these problems for more complex curves. I suspect that it would be possible to construct short formulas that describe curves that are mostly empty but have a huge bounding box.
If you happen to know more about the structure of the problems you're trying to solve - for example, if you're always dealing with spheres or ellipses - then you may be able to find fast algorithms for this problem. In general, though, I don't think you'll be able to do much better than brute force. I'm willing to admit that (and in fact, hopeful that) someone will prove me wrong about this, though.
The idea behind the kd-tree method is that you recursive subdivide the search box and try to rule out whole boxes at a time. Given the current box, use some method that either (a) declares that all points in the box match the predicate (b) declares that no points in the box match the predicate (c) makes no declaration (one possibility, which may be particularly convenient in Haskell: interval arithmetic). On (c), cut the box in half (say along the longest dimension) and recursively count in the halves. Obviously the method can choose (c) all the time, which devolves to brute force; the goal here is to do (a) or (b) as much as possible.
The performance of this method is very dependent on how it's instantiated. Try it -- it shouldn't be more than a couple dozen lines of code.
For nicely connected region, assuming your shape is significantly smaller than your containing search space, and given a seed point, you could do a growth/building algorithm:
Given a seed point:
Push seed point into test-queue
while test-queue has items:
Pop item from test-queue
If item tests to be within region (eg using a callback function):
Add item to inside-set
for each neighbour point (generated on the fly):
if neighbour not in outside-set and neighbour not in inside-set:
Add neighbour to test-queue
else:
Add item to outside-set
return inside-set
The trick is to find an initial seed point that is inside the function.
Make sure your set implementation gives O(1) duplicate checking. This method will eventually break down with large numbers of dimensions as the surface area exceeds the volume, but for 5 dimensions should be fine.

Most efficient algorithm to compute a common numerator of a sum of fractions

I'm pretty sure that this is the right site for this question, but feel free to move it to some other stackexchange site if it fits there better.
Suppose you have a sum of fractions a1/d1 + a2/d2 + … + an/dn. You want to compute a common numerator and denominator, i.e., rewrite it as p/q. We have the formula
p = a1*d2*…*dn + d1*a2*d3*…*dn + … + d1*d2*…d(n-1)*an
q = d1*d2*…*dn.
What is the most efficient way to compute these things, in particular, p? You can see that if you compute it naïvely, i.e., using the formula I gave above, you compute a lot of redundant things. For example, you will compute d1*d2 n-1 times.
My first thought was to iteratively compute d1*d2, d1*d2*d3, … and dn*d(n-1), dn*d(n-1)*d(n-2), … but even this is inefficient, because you will end up computing multiplications in the "middle" twice (e.g., if n is large enough, you will compute d3*d4 twice).
I'm sure this problem could be expressed somehow using maybe some graph theory or combinatorics, but I haven't studied enough of that stuff to have a good feel for it.
And one note: I don't care about cancelation, just the most efficient way to multiply things.
UPDATE:
I should have known that people on stackoverflow would be assuming that these were numbers, but I've been so used to my use case that I forgot to mention this.
We cannot just "divide" out an from each term. The use case here is a symbolic system. Actually, I am trying to fix a function called .as_numer_denom() in the SymPy computer algebra system which presently computes this the naïve way. See the corresponding SymPy issue.
Dividing out things has some problems, which I would like to avoid. First, there is no guarantee that things will cancel. This is because mathematically, (a*b)**n != a**n*b**n in general (if a and b are positive it holds, but e.g., if a == b ==-1 and n == 1/2, you get (a*b)**n == 1**(1/2) == 1 but (-1)**(1/2)*(-1)**(1/2) == I*I == -1). So I don't think it's a good idea to assume that dividing by an will cancel it in the expression (this may be actually be unfounded, I'd need to check what the code does).
Second, I'd like to also apply a this algorithm to computing the sum of rational functions. In this case, the terms would automatically be multiplied together into a single polynomial, and "dividing" out each an would involve applying the polynomial division algorithm. You can see in this case, you really do want to compute the most efficient multiplication in the first place.
UPDATE 2:
I think my fears for cancelation of symbolic terms may be unfounded. SymPy does not cancel things like x**n*x**(m - n) automatically, but I think that any exponents that would combine through multiplication would also combine through division, so powers should be canceling.
There is an issue with constants automatically distributing across additions, like:
In [13]: 2*(x + y)*z*(S(1)/2)
Out[13]:
z⋅(2⋅x + 2⋅y)
─────────────
2
But this is first a bug and second could never be a problem (I think) because 1/2 would be split into 1 and 2 by the algorithm that gets the numerator and denominator of each term.
Nonetheless, I still want to know how to do this without "dividing out" di from each term, so that I can have an efficient algorithm for summing rational functions.
Instead of adding up n quotients in one go I would use pairwise addition of quotients.
If things cancel out in partial sums then the numbers or polynomials stay smaller, which makes computation faster.
You avoid the problem of computing the same product multiple times.
You could try to order the additions in a certain way, to make canceling more likely (maybe add quotients with small denominators first?), but I don't know if this would be worthwhile.
If you start from scratch this is simpler to implement, though I'm not sure it fits as a replacement of the problematic routine in SymPy.
Edit: To make it more explicit, I propose to compute a1/d1 + a2/d2 + … + an/dn as (…(a1/d1 + a2/d2) + … ) + an/dn.
Compute two new arrays:
The first contains partial multiples to the left: l[0] = 1, l[i] = l[i-1] * d[i]
The second contains partial multiples to the right: r[n-1] = 1, r[i] = d[i] * r[i+1]
In both cases, 1 is the multiplicative identity of whatever ring you are working in.
Then each of your terms on the top, t[i] = l[i-1] * a[i] * r[i+1]
This assumes multiplication is associative, but it need not be commutative.
As a first optimization, you don't actually have to create r as an array: you can do a first pass to calculate all the l values, and accumulate the r values during a second (backward) pass to calculate the summands. No need to actually store the r values since you use each one once, in order.
In your question you say that this computes d3*d4 twice, but it doesn't. It does multiply two different values by d4 (one a right-multiplication and the other a left-multiplication), but that's not exactly a repeated operation. Anyway, the total number of multiplications is about 4*n, vs. 2*n multiplications and n divisions for the other approach that doesn't work in non-commutative multiplication or non-field rings.
If you want to compute p in the above expression, one way to do this would be to multiply together all of the denominators (in O(n), where n is the number of fractions), letting this value be D. Then, iterate across all of the fractions and for each fraction with numerator ai and denominator di, compute ai * D / di. This last term is equal to the product of the numerator of the fraction and all of the denominators other than its own. Each of these terms can be computed in O(1) time (assuming you're using hardware multiplication, otherwise it might take longer), and you can sum them all up in O(n) time.
This gives an O(n)-time algorithm for computing the numerator and denominator of the new fraction.
It was also pointed out to me that you could manually sift out common denominators and combine those trivially without multiplication.

Minimize a function

Suppose you are given a function of a single variable and arguments a and b and are asked to find the minimum value that the function takes on the interval [a, b]. (You can assume that the argument is a double, though in my application I may need to use an arbitrary-precision library.)
In general this is a hard problem because functions can be weird. A simple version of this problem would be to minimize the function assuming that it is continuous (no gaps or jumps) and single-peaked (there is a unique minimum; to the left of the minimum the function is decreasing and to the right it is increasing). Is there a good way to solve this easier (but perhaps not easy!) problem?
Assume that the function may be difficult to calculate but not particularly expensive to store an answer that you've computed. (Obviously, it's better if you don't have to make giant arrays of key/value pairs.)
Bonus points for good ideas on improving the algorithm in the fortunate case in which it's nice (e.g.: derivative exists, function is smooth/analytic, derivative can be computed in closed form, derivative can be computed at no cost when the function is evaluated).
The version you describe, with a single minimum, is easy to solve.
The idea is this. Suppose that I have 3 points with a < b < c and f(b) < f(a) and f(b) < f(c). Then the true minimum is between a and c. Furthermore if I pick another point d somewhere in the interval, then I can throw away one of a or d and still have an interval with the true minimum in the middle. My approximations will improve exponentially quickly as I do more iterations.
We don't quite start with this. We start with 2 points, a and b, and know that the answer is somewhere in the middle. Take the mid-point. If f there is below the end points, we're into the case I discussed above. Otherwise it must be below one of the end points, and above the other. We can throw away the higher end point and repeat.
If the function is nice, i.e., single-peaked and strictly monotonic (i.e., strictly decreasing to the left of the minimum and strictly increasing to the right), then you can find the minimum with binary search:
Set x = (b-a)/2
test whether x is to the right of the minimum or to the left
if x is left of the minimum:b = x
if x is right of the minimum:a = x
repeat from start until you get bored
the minimum is at x
To test whether x is left/right of the minimum, invent a small value epsilon and check whether f(x - epsilon) < f(x + epsilon). If it is, the minimum is to the left, otherwise it's to the right. By "until you get bored", I mean: invent another small value delta and stop if fabs(f(x - epsilon) - f(x + epsilon)) < delta.
Note that in the general case where you don't know anything about the behavior of a function f, it's not possible to decide a non-trivial property of f. Well, unless you're willing to try all possible inputs. See Rice's Theorem for details.
The Boost project has an implementation of Brent's algorithm that may be useful.
It seems to assume that the function is continuous, and has no maxima (only a minimum) in the input interval.
Not a direct answer but a pointer to more reading:
scipy.optimize: http://docs.scipy.org/doc/scipy/reference/optimize.html
section e04 of naglib: http://www.nag.co.uk/numeric/cl/nagdoc_cl09/html/genint/libconts.html
For the special case where the function is differentiable twice (and the two derivatives can be calculated easily), one can use Newton's method for optimization, i.e. essentially finding the roots of the first derivative (which is a necessary condition for the minimum).
Concerning the general case, note that the extreme case of 'weird' is a function which is continuous nowhere and for which it is very hard if not impossible to find the minimum (in finite time). So I guess you should try to make at least some assumptions about the function you are trying to minimize.
What you want is to optimize an Unimodal function. The correct algorithm is similar to btilly's but you need extra points.
Take 4 points a < b < c < d.
We want to minimize f in [a,d].
If f(b) < f(c) we know the minimum is in [a, c]
If f(b) > f(c) " " " " is in [b, d]
This can give an algorithm by itself, but there is a nice trick involving the golden ratio that allows you to reuse the intermediate values (in a way you only need to compute f once per iteration instead of twice)
If you have an expression for the function, there are global optimization algorithms based on interval analysis.

finding smallest scale factor to get each number within one tenth of a whole number from a set of doubles

Suppose we have a set of doubles s, something like this:
1.11, 1.60, 5.30, 4.10, 4.05, 4.90, 4.89
We now want to find the smallest, positive integer scale factor x that any element of s multiplied by x is within one tenth of a whole number.
Sorry if this isn't very clear—please ask for clarification if needed.
Please limit answers to C-style languages or algorithmic pseudo-code.
Thanks!
You're looking for something called simultaneous Diophantine approximation. The usual statement is that you're given real numbers a_1, ..., a_n and a positive real epsilon and you want to find integers P_1, ..., P_n and Q so that |Q*a_j - P_j| < epsilon, hopefully with Q as small as possible.
This is a very well-studied problem with known algorithms. However, you should know that it is NP-hard to find the best approximation with Q < q where q is another part of the specification. To the best of my understanding, this is not relevant to your problem because you have a fixed epsilon and want the smallest Q, not the other way around.
One algorithm for the problem is (Lenstra–Lenstra)–Lovász's lattice reduction algorithm. I wonder if I can find any good references for you. These class notes mention the problem and algorithm, but probably aren't of direct help. Wikipedia has a fairly detailed page on the algorithm, including a fairly large list of implementations.
To answer Vlad's modified question (if you want exact whole numbers after multiplication), the answer is known. If your numbers are rationals a1/b1, a2/b2, ..., aN/bN, with fractions reduced (ai and bi relatively prime), then the number you need to multiply by is the least common multiple of b1, ..., bN.
This is not a full answer, but some suggestions:
Note: I'm using "s" for the scale factor, and "x" for the doubles.
First of all, ask yourself if brute force doesn't work. E.g. try s = 1, then s = 2, then s = 3, and so forth.s
We have a list of numbers x[i], and a tolerance t = 1/10. We want to find the smallest positive integer s, such that for each x[i], there is an integer q[i] such that |s * x[i] - q[i]| < t.
First note that if we can produce an ordered list for each x[i], it's simple enough to merge these to find the smallest s that will work for all of them. Secondly note that the answer depends only on the fractional part of x[i].
Rearranging the test above, we have |x - q/s| < t/s. That is, we want to find a "good" rational approximation for x, in the sense that the approximation should be better than t/s. Mathematicians have studied a variant of this where the criterion for "good" is that it has to be better than any with a smaller "s" value, and the best way to find these is through truncations of the continued fraction expansion.
Unfortunately, this isn't quite what you need, since once you get under your tolerance, you don't necessarily need to continue to get increasingly better -- the same tolerance will work. The next obvious thing is to use this to skip to the first number that would work, and do brute force from there. Unfortunately, for any number the largest the first s can be is 5, so that doesn't buy you all that much. However, this method will find you an s that works, just not the smallest one. Can we use this s to find a smaller one, if it exists? I don't know, but it'll set an upper limit for brute forcing.
Also, if you need the tolerance for each x to be < t, than this means the tolerance for the product of all x must be < t^n. This might let you skip forward a great deal, and set a reasonable lower limit for brute forcing.

How to choose group of numbers in the vector

I have an application with some probabilities of measured features. I want to select n-best features from vector. I have a vector of real numbers. Vector is normalized, sum of all numbers is 1 (it is probability of some features).
I want to select group of n less than N (assume approx. 8) largest numbers. Numbers has to be close together without gaps and they're also should have large sum (sum of remaining numbers should be several times lower).
Any ideas how to accomplish that?
I tried to use 80% quantile (but it is not sensitive to relative large gaps like [0.2, 0.2, 0.01, 0.01, 0.001, 0.001 ... len ~ 100] ), I tried a some treshold between two successive numbers, but nothing work too good.
I have some partial solution at this moment but I am just wondering if there is some simple solution that I have overlooked.
John's answer is good. Also you might try
sort the probabilities
find the largest gap between successive probabilities
work up from there
From there, it's starting to sound like a pattern-recognition problem.My favorite method is markov-chain-monte-carlo(MCMC).
Edit: Since you clarified your question, my first thought is, since you only have 8 possible answers, develop a score for each one, based on how much probability it contains and whether or not it splits at a gap, and make a heuristic judgement.
Further edit: This sounds a bit like logistic regression. You want to find a value of P that effectively divides your set into members and non-members. For a given value of P, you can compute a log-likelihood for the ensemble, and choose P that maximizes that.
It sounds like you're wanting to select the n largest probabilities but the number n is flexible. If n were fixed, say n=10, you could just sort your vector and pull out the top 10 items. But from your example it sounds like you'd like to use a smaller value of n if there's a natural break in the data. Maybe you want to start with the largest probability and go down the list selecting items until the sum of the probabilities you pick crosses some threshold.
Maybe you have an implicit optimization problem where you want to maximize some probability with some penalty for large n. Try stating your problem that way. You might find your own answer, or you might be able to rephrase your question here in a way that helps other people give you a better answer.
I'm not really sure if this is what you want, but it seems you want to do the following.
Lets assume that the probabilities are x_1,...,x_N in increasing order. Then you should try to find 1<= i < j <= N such that the function
f(i,j) = (x_i + x_(i+1) + ... + x_j)/(x_j - x_i)
is maximized. This can be done naively in quadratic time.

Resources