Algorithm to choose multiple discrete parameters based on input vector - algorithm

I am faced with the following problem: Given a point in k-dimensional space, choose a set of discrete parameters to maximize the probability of a positive (binary) outcome. I have training examples in the same form, for example
point parameters good?
------ ---------- -----
1) x1 x2 x3 p1 p2 p3 NO
2) x1 x2 x3 p1 p2 p3 YES
3) x1 x2 x3 p1 p2 p3 YES
...etc.
All parameters are free variables, and there is an arbitrary number of them (k is also arbitrary). I have considered
Generate a clustering of the points, tune the parameters for each cluster, and then associate each new point with a cluster.
Develop a model to predict each parameter separately.
Both have major drawbacks. I was wondering if there is a more systematic approach to going about this (seems like a common enough problem). Can anyone point me towards some relevant reading or an algorithm?
Thanks, and I apologize in advance if this is the wrong place to ask these kinds of questions.

This is a classic classification (data mining) problem and it's up to you to pick which algorithm to use. The most common approaches are:
KNN (k-nearest-neighbor)
Bayes classifier
SVM (support vector machine)
Decision trees
You should read up about them and decide which one is best for your problem, unfortunately there is no 'best' approach for all domains and data.

Another simple technique you haven't mentioned is k-nearest neighbours - find the nearest positive point in k-dimensional space to your input point and copy its choice of parameters.
If you knew or could find out more about what the k-dimensional space or parameters actually mean, you might be able to use this knowledge to construct a good model.

Related

Interpolation algorithm

I have a question regarding how to do interpolation like following case:
There are basically two sets of data, "o" and "*". In any case one of them is known, and I am trying to get the other by doing interpolation. There are some assumptions/conditions listed below:
p1, p2, p3....are the positions, p12, p23 are the values for the intervals that hold them. Same for d1, d2, d3 and d12, d23.
both o and * are distributed on the common axis (x axis in this case)
both o and * are equal-distantly distributed. Meaning
p2-p1 = p3-p2 = .....
and
d2-d1 = d3-d2 = .......
all positions (p1, p2, p3,... d1, d2, d3.....) are known, one of the data values are known (ex. p12, and p23), the other is unknown (ex. d12, and d23).
One example:
If p12 and p23 are known, and to calculate d23, d34 and d45, we simply consider the contribution of each value weighed by their length into the other data set.
I am just wondering, in the sense of computer science is there a efficient algorithm of interpolation for this particular setup? My intuition is because all the data are distributed with equi-distance, there should be some sorta simplification/acceleration can be done? Or anyone can point out a way so I can do some literature reading? Thanks a lot.
What you're trying to do is take a known set of points, use that to interpolate a function, and then evaluate that interpolated function at another set of points.
This is a huge topic. You can develop your function to be piecewise linear, piecewise polynomial, a Fourier series, using wavelet algorithms..it all comes down to what kind of underlying function you think that you are trying to represent. And that depends on your underlying problem.

Should k-means input contain unique values or all values (repeated as well)?

I am clustering my single dimensional data with a kmeans implementation. Although there are methods like Jenks breaks and Fishers's natural breaks for single dimensional data I still chose to go with kmeans.
My question is what difference does it make if I only cluster unique values in the list of data points I have OR if I use all data points (repetition).
What is advisable?
This can certainly make a difference: the mean of [-1 -1 1] is -.33, while the mean of [-1 1] is 0. What you should do depends on the data and what you want to do with the result of clustering. As a default, though, I'd say keep them: removing points changes the local densities that k-means is designed to pick as cluster centers, and also why would you remove duplicates, but not near-duplicates?
k-means is an optimization method which minimizes the distortion of an assignment of your data points into clusters. The distortion is the sum of within cluster sum of squares. Or, if L is a set of labels and P the set of points, if has indicates that a point has a particular label, and d is the distance between points, then
distortion = sum [ d(p1, p2)^2 | p1 <- P
, p2 <- P
, l <- L
, p1 has l and p2 has l
]
We can study the result of a successful k-means optimization by talking about this distortion. For instance, given any two points on top of one another we have the distance between them d(p1, p2) = 0 and so if they're in the same cluster then they are increasing the distortion by nothing at all. So, somewhat obviously, a good clustering will always have all point duplicates in the same cluster.
Now consider a set of 3 points like this
A ? B
---p----------q----------r---
In other words, three equidistant points, the two on the outside of different labels and the one on the inside of an unknown label. The distances (measured in -es) are d(p,q) = 10 = d(q,r) so if we label q as A we increase our distortion by 100 and same if we label it B.
If we change this situation slightly by replicating the point p then we've not increased the distortion at all (since d(p,s) = 0) but labeling q has A then we'll increase the distortion by d(p,q)^2 + d(s,q)^2 = 100 + 100 = 200 while if we label it q has B then the distortion increases only by d(q,r)^2 = 100.
A ? B
---p----------q----------r---
s
So this replication has repulsed q away from label A.
Now if you play around with k-means for a bit, you might be surprised by the analysis above. It'll turn out to be the case that adding a whole lot of replication of a single point won't really produce the linearly scaling impact it seems like it ought to.
This is because actual optimization of that metric is known to be NP-hard in almost any circumstance. If you truly want to optimize it and have n points with K labels then your best bet is to check all K^n labelings. Thus, most k-means algorithms are approximate and thus you suffer some search error between the true optimum and the result of your algorithm.
For k-means, this will be happen especially when there are lots of replicated points as these "replicated pools" still grab points according to their distance from the centroid... not actually due to their global minimization properties.
Finally, when talking about replication in machine learning algorithms it's worth noting that most machine learning algorithms are based on assumptions about data which actively preclude the idea of replicated data points. This is known broadly as "general position" and many proofs begin by assuming your data is in "general position".
The idea is that if your points are truly distributed in R^n then there's 0 probability that two points will be identical under any of the probability distributions which are "nice" enough to build algorithms atop.
What this generally means is that if you have data with a lot of replicated points, you should consider the impact of a small "smoothing" step prior to analysis. If perturbing all of your points by a small normally distributed jump does not affect the meaning of the data... then you're probably quite OK running normal ML algorithms that anticipate the data living in R^n. If not, then you should consider algorithms which better respect the structure of your data—perhaps it's better to see your data as a tree and run an algorithm for ML atop structured data.

How to implement a superoptimizer

[Related to https://codegolf.stackexchange.com/questions/12664/implement-superoptimizer-for-addition from Sep 27, 2013]
I am interested in how to write superoptimizers. In particular to find small logical formulae for sums of bits. This was previously set this as a challenge on codegolf but it seems a lot harder than one might imagine.
I would like to write code that finds the smallest possible propositional logical formula to check if the sum of y binary 0/1 variables equals some value x. Let us call the variables x1, x2, x3, x4 etc. In the simplest approach the logical formula should be equivalent to the sum. That is, the logical formula should be true if and only if the sum equals x.
Here is a naive way to do that. Say y=15 and x = 5. Pick all 3003 different ways of choosing 5 variables and for each make a new clause with the AND of those variables AND the AND of the negation of the remaining variables. You end up with 3003 clauses each of length exactly 15 for a total cost of 45054.
However, if you are allowed to introduce new variables into your solution then you can potentially reduce this a lot by eliminating common subformulae. So in this case your logical formula consists of the y binary variables, x and some new variables. The whole formula would be satisfiable if and only if the sum of the y variables equals x. The only allowed operators are and, or and not.
It turns out there is a clever method for solving this problem when x =1, at least in theory . However, I am looking for a computational intensive method to search for small solutions.
How can you make a superoptimizer for this problem?
Examples. Take as an example two variables where we want a logical formula that is True exactly when they sum to 1. One possible answer is:
(((not y0) and (y1)) or ((y0) and (not y1)))
To introduce a new variable into a formula such as z0 to represent y0 and not y1 then we can introduce a new clause (y0 and not y1) or not z0 and replace y0 and not y1 by z0 throughout the rest of the formula . Of course this is pointless in this example as it makes the expression longer.
Write your desired sum in binary. First look at the least important bit, y0 . Clearly,
x1 xor x2 xor ... xor xn = y0 - that's your first formula. The final formula will be a conjunction of formulae for each bit of the desired sum.
Now, do you know how an adder is implemented? http://en.wikipedia.org/wiki/Adder_(electronics) . Take inspiration from it, group your input into pairs/triples of bits, calculate the carry bits, and use them to make formulae for y1...yk . If you need further hints, let me know.
If I understand what you're asking, you'll want to look into the general topics of logic minimization and/or Boolean function simplification. The references are mostly about general methods for eliminating redundancy in Boolean formulas that are disjunctions ("or"s) of terms that are conjunctions ("and"s).
By hand, the standard method is called a Karnaugh map. The equivalent algorithm expressed in a way that's more amenable to computer implementation is Quine-McKlosky (also called the method of prime implicants). The minimization problem is NP-hard, and QM solves it exactly.
Therefore I think QM is what you want for the "super-optimizer" you're trying to build.
But the combination of NP-hard and exact solution means that QM is impractical for large problems, at least non-trivial ones.
The QM Algorithm lays out the conjunctive terms (called minterms in this context) in a table and conducts searches for 1-bit differences between pairs of terms. These terms can be combined and the factor for the differing bit labeled "don't care" in further combinations. This is repeated with 2-bit, 4-bit, etc. subsets of bits. The exponential behavior results because choices are involved for the combinations of larger bit sets: choosing one rules out another. Therefore it is essentially a search problem.
There is an enormous literature on heuristics to trim the search space, yet find "good" solutions that aren't necessarily optimal. A famous one is Espresso. However, since algorithm improvements translate directly to dollars in semiconductor manufacture, it's entirely possible that the best are proprietary and closely held.

Machine Learning Algorithm for Completing Sparse Matrix Data

I've seen some machine learning questions on here so I figured I would post a related question:
Suppose I have a dataset where athletes participate at running competitions of 10 km and 20 km with hilly courses i.e. every competition has its own difficulty.
The finishing times from users are almost inverse normally distributed for every competition.
One can write this problem as a matrix:
Comp1 Comp2 Comp3
User1 20min ?? 10min
User2 25min 20min 12min
User3 30min 25min ??
User4 30min ?? ??
I would like to complete the matrix above which has the size 1000x20 and a sparseness of 8 % (!).
There should be a very easy way to complete this matrix, since I can calculate parameters for every user (ability) and parameters for every competition (mu, lambda of distributions). Moreover the correlation between the competitions are very high.
I can take advantage of the rankings User1 < User2 < User3 and Item3 << Item2 < Item1
Could you maybe give me a hint which methods I could use?
Your astute observation that this is a matrix completion problem gets
you most of the way to the solution. I'll codify your intuition that
the combination of ability of a user and difficulty of the course
yields the time of a race, then present various algorithms.
Model
Let the vector u denote the speed of the users so that u_i is user i's
speed. Let the vector v denote the difficulty of the courses so
that v_j is course j's difficulty. Also when available, let t_ij be user i's time on
course j, and define y_ij = 1/t_ij, user i's speed on course j.
Since you say the times are inverse Gaussian distributed, a sensible
model for the observations is
y_ij = u_i * v_j + e_ij,
where e_ij is a zero-mean Gaussian random variable.
To fit this model, we search for vectors u and v that minimize the
prediction error among the observed speeds:
f(u,v) = sum_ij (u_i * v_j - y_ij)^2
Algorithm 1: missing value Singular Value Decomposition
This is the classical Hebbian
algorithm. It
minimizes the above cost function by gradient descent. The gradient of
f wrt to u and v are
df/du_i = sum_j (u_i * v_j - y_ij) v_j
df/dv_j = sum_i (u_i * v_j - y_ij) u_i
Plug these gradients into a Conjugate Gradient solver or BFGS
optimizer, like MATLAB's fmin_unc or scipy's optimize.fmin_ncg or
optimize.fmin_bfgs. Don't roll your own gradient descent unless you're willing to implement a very good line search algorithm.
Algorithm 2: matrix factorization with a trace norm penalty
Recently, simple convex relaxations to this problem have been
proposed. The resulting algorithms are just as simple to code up and seem to
work very well. Check out, for example Collaborative Filtering in a Non-Uniform World:
Learning with the Weighted Trace Norm. These methods minimize
f(m) = sum_ij (m_ij - y_ij)^2 + ||m||_*,
where ||.||_* is the so-called nuclear norm of the matrix m. Implementations will end up again computing gradients with respect to u and v and relying on a nonlinear optimizer.
There are several ways to do this, perhaps the best architecture to try first is the following:
(As usual, as a preprocessing step normalize your data into a uniform function with 0 mean and 1 std deviation as best you can. You can do this by fitting a function to the distribution of all race results, applying its inverse, and then subtracting the mean and dividing by the std deviation.)
Select a hyperparameter N (you can tune this as usual with a cross validation set).
For each participant and each race create an N-dimensional feature vector, initially random. So if there are R races and P participants then there are R+P feature vectors with a total of N(R+P) parameters.
The prediction for a given participant and a given race is a function of the two corresponding feature vectors (as a first try use the scalar product of these two vectors).
Alternate between incrementally improving the participant feature vectors and the race feature vectors.
To improve a feature vector use gradient descent (or some more complex optimization method) on the known data elements (the participant/race pairs for which you have a result).
That is your loss function is:
total_error = 0
forall i,j
if (Participant i participated in Race j)
actual = ActualRaceResult(i,j)
predicted = ScalarProduct(ParticipantFeatures_i, RaceFeatures_j)
total_error += (actual - predicted)^2
So calculate the partial derivative of this function wrt the feature vectors and adjust them incrementally as per a usual ML algorithm.
(You should also include a regularization term on the loss function, for example square of the lengths of the feature vectors)
Let me know if this architecture is clear to you or you need further elaboration.
I think this is a classical task of missing data recovery. There exist some different methods. One of them which I can suggest is based on Self Organizing Feature Map (Kohonen's Map).
Below it's assumed that every athlet record is a pattern, and every competition data is a feature.
Basically, you should divide your data into 2 sets: first - with fully defined patterns, and second - patterns with partially lost features. I assume this is eligible because sparsity is 8%, that is you have enough data (92%) to train net on undamaged records.
Then you feed first set to the SOM and train it on this data. During this process all features are used. I'll not copy algorithm here, because it can be found in many public sources, and even some implementations are available.
After the net is trained, you can feed patterns from the second set to the net. For each pattern the net should calculate best matching unit (BMU), based only on those features that exist in the current pattern. Then you can take from the BMU its weigths, corresponding to missing features.
As alternative, you could not divide the whole data into 2 sets, but train the net on all patterns including the ones with missing features. But for such patterns learning process should be altered in the similar way, that is BMU should be calculated only on existing features in every pattern.
I think you can have a look at the recent low rank matrix completion methods.
The assumption is that your matrix has a low rank compared to the matrix dimension.
min rank(M)
s.t. ||P(M-M')||_F=0
M is the final result, and M' is the uncompleted matrix you currently have.
This algorithm minimizes the rank of your matrix M. P in the constraint is an operator that takes the known terms of your matrix M', and constraint those terms in M to be the same as in M'.
The optimization of this problem has a relaxed version, which is:
min ||M||_* + \lambda*||P(M-M')||_F
rank(M) is relaxed to its convex hull ||M||_* Then you trade off the two terms by controlling the parameter lambda.

Trilateration of a signal using Time Difference of Arrival

I am having some trouble to find or implement an algorithm to find a signal source. The objective of my work is to find the sound emitter position.
To accomplish this I am using three microfones. The technique that I am using is multilateration that is based on the time difference of arrival.
The time difference of arrival between each microfones are found using Cross Correlation of the received signals.
I already implemented the algorithm to find the time difference of arrival, but my problem is more on how multilateration works, it's unclear for me based on my reference, and I couldn't find any other good reference for this that are free/open.
If you have some references on how I can implement a multilateration algorithm, or some other trilateration algorithm that I can use based on time difference of arrival it would be a great help.
Thanks in advance.
The point you are looking for is the intersection of three hyperbolas. I am assuming 2D here since you only use 3 receptors. Technically, you can find a unique 3D solution but as you likely have noise, I assume that if you wanted a 3D result, you would have taken 4 microphones (or more).
The wikipedia page makes some computations for you. They do it in 3D, you just have to set z = 0 and solve for system of equations (7).
The system is overdetermined, so you will want to solve it in the least squares sense (this is the point in using 3 receptors actually).
I can help you with multi-lateration in general.
Basically, if you want a solution in 3d - you have to have at least 4 points and 4 distances from them (2-give you the circle in which is the solution - because that is the intersection between 2 spheres, 3 points give you 2 possible solutions (intersection between 3 spheres) - so, in order to have one solution - you need 4 spheres). So, when you have some points (4+) and the distance between them (there is an easy way to transform the TDOA into the set of equations for just having the length type distances /not time/) you need a way to solve the set of equations. First - you need a cost function (or solution error function, as I call it) which would be something like
err(x,y,z) = sum(i=1..n){sqrt[(x-xi)^2 + (y-yi)^2 + (z-zi)^2] - di}
where x, y, z are coordinates of the current point in the numerical solution and xi, yi, zi and di are the coordinates and distance towards the ith reference point. In order to solve this - my advice is NOT to use Newton/Gauss or Newton methods. You need first and second derivative of the aforementioned function - and those have a finite discontinuation in some points in space - hence that is not a smooth function and these methods won't work. What will work is direct search family of algorithms for optimization of functions (finding minimums and maximums. in our case - you need minimum of the error/cost function).
That should help anyone wanting to find a solution for similar problem.

Resources