How to predict a True/False event - algorithm

I have some measurements on a system (x, y, z, ..) over many trials. The system produces a true or false output. I would like to take my data and produce a predictor function of x,y, z, that would best predict the system outcome.
I am used to methods for approximating smooth outcomes like approximating a graph, but don't know the terms to search for when the outcome is true/false.

Search for multivariate classification.
In your case you just have two classes (true and false).
The Wikipedia article on statistical classification has a list of commonly used algorithms.
You can also search for multivariate regression which attempts to model a real value as function of several values where in your case the possible values are a discrete set (0,1).
One would have to take a decision on whether the predicted outcome is True or False based on the regression function's output (e.g. assume True if the output is > 0.5 and False if it's <= 0.5).
Note that there is also https://stats.stackexchange.com/ where you could get more detailed answers related to the analysis of data.

You are basically wanting the probability of TRUE or FALSE. A standard technique is logistic regression. Logistic regression is a useful way of describing the relationship between a binary response variable and some independent variables. Since the output is a probability, it is easily interpretable.
There are standard libraries in most languages to implement logistic regression.

A neural network seems like a perfect fit for your problem.

Related

Obtaining the functional form of a curve

The following is the plot of a curve f(r), where r is the radial coordinate, and plotted for different values of a parameter as shown:
However, I don't know the functional form of the curve and I am interested to find the same. Are there any numerical methods which can be used to find the functional form of f(r) in terms of the radial coordinate and the parameter?
I had found a solution of the problem based on the suggestion by ja72 to use the Eureqa software which churns through the data to create accurate predictive models using evolutionary search algorithm.
In the question, the different curves corresponds to different values of . So, initially I obtained the best fit equation for different values of and found that the following model equation is suitable for my purpose:
Then, I repeated the process for a large number of values of and calculated the values of the four functions for different values of and then individually fitted these four functions. The following are the results that I obtained:
N.B.: Eureqa gave several other better fitting formulas than those mentioned in the answer. But the formulas that I mentioned are sufficiently accurate for my purpose and have minimum complexity.
A blind curve fit without an underlying model is a dangerous thing.
You need to have an understanding of the physical model behind the data to create a successful fit. The reason is that if r is distance and the best fit curve uses r^0.4072 for example, that dimension raised to a decimal power bears no meaning and it hides any underlying assumptions.Like some other dimension l not included in the model, whereas only the dimensionless quantity (r/l) would make sense to raise to the decimal power.
From a function analysis standpoint
These curves are not the result of any standard math function. Well I am not that familiar with bessel functions, gamma functions and legendre polynomials. But none of the standard functions you find in a scientific calculator jumps out here.
If r is assumed to be dimensionless, then you try to match the asymptotic behavior when r -> 0 and when r -> ∞. The would be the baseline curve. To me it does not look hyperbolic, but rather close to 1/LN(1+r).
So change the variables make g=1/LN(1+r) and plot f(r) against g(r) and see what that looks like. Then try another round of curve fitting in the new curves ... and so on.
Nobody can answer this question
Nobody else could effectively answer this question but you, because a) you have the data, and b) you need to make assumptions about what region is important or not, and what is acceptable deviation.

How to form precision-recall curve using one test dataset for my algorithm?

I'm working on knowledge graph, more precisely in natural language processing field. To evaluate the components of my algorithm, it is necessary to be able to classify the good and the poor candidates. For this purpose, we manually classified pairs in a dataset.
My system returns the relevant pairs according to the implementation logic. now I'm able to calculate :
Precision = X
Recall = Y
For establishing a complete curve I need the rest of points (X,Y), what should I do?:
build another dataset for test ?
split my dataset ?
or any other solution ?
Neither of your proposed two methods. In short, Precision-recall or ROC curve is designed for classifiers with probabilistic output. That is, instead of simply producing a 0 or 1 (in case of binary classification), you need classifiers that can provide a probability in [0,1] range. This is the function to do it in sklearn, note how the 2nd parameter is called probas_pred.
To turn this probabilities into concrete class prediction, you can then set a threshold, say at .5. Setting such a threshold is problematic however, since you can trade-off precision/recall by varying the threshold, and an arbitrary choice can give false impression of a classifier's performance. To circumvent this, threshold-independent measures like area under ROC or Precision-Recall curve is used. They create thresholds at different intervals, say 0.1,0.2,0.3...0.9, turn probabilities into binary classes and then compute precision-recall for each such threshold.

Evaluation in Elki

I know ELKI currently only includes unsupervised outlier detection methods, therefore Elki doesn't divide input data in traing set and test set.
But, i've seen evaluation is over minority class when avaiable. i would like to know:
Does elki use all input data to evaluation?
Does runtime take account evaluation or just training time?
Does evaluation take account outliers scores to estimate false positive rate and true positive rate in order to evaluate rankings?
In LOF algorithm, for example, suppose a instance in normal class has a high LOF score. will it be consider a false positive or true positive in evaluation?
Thanks!
Yes, all input is used for unsupervised methods.
The labels must not have been used for running the algorithm, they are only used at evaluation time.
Runtime reported is separately for every algorithm.
This depends on your evaluation. Most measures (e.g. ROC AUC) will only take the ranking into account. To evaluate the actual scores, you first need to normalize them. For a measure that takes (normalized) scores into account, please see
E. Schubert, R. Wojdanowski, A. Zimek, H.-P. KriegelOn Evaluation of Outlier Rankings and Outlier ScoresIn Proceedings of the 12th SIAM International Conference on Data Mining (SDM), Anaheim, CA: 1047–1058, 2012.
True positive and false positives require a binary decision. See ROC AUC for an approach that does not require to specify a threshold to make the decision binary, but evaluate all possible thresholds.

why overfitting gives a bad hypothesis function

In linear or logistic regression if we find a hypothesis function which fits the training set perfectly then it should be a good thing because in that case we have used 100 % of the information given to predict new information.
While it is called to be overfitting and said to be bad thing.
By making the hypothesis function simpler we may be actually increasing the noise instead of decreasing it.
Why is it so?
Overfitting occurs when you try "too hard" to make the examples in the training set fit the classification rule.
It is considered bad thing for 2 reasons main reasons:
The data might have noise. Trying too hard to classify 100% of the examples correctly, will make the noise count, and give you a bad rule while ignoring this noise - would usually be much better.
Remember that the classified training set is just a sample of the real data. This solution is usually more complex than what you would have got if you tolerated a few wrongly classified samples. According to Occam's Razor, you should prefer the simpler solution, so ignoring some of the samples, will be better,
Example:
According to Occam's razor, you should tolerate the misclassified sample, and assume it is noise or insignificant, and adopt the simple solution (green line) in this data set:
Because you actually didn't "learn" anything from your training set, you've just fitted to your data.
Imagine, you have a one-dimensional regression
x_1 -> y_1
...
x_n -> y_1
The function, defined this way
y_n, if x = x_n
f(x)=
0, otherwise
will give you perfect fit, but it's actually useless.
Hope, this helped a bit:)
Assuming that your regression accounts for all source of deviation in your data, then you might argue that your regression perfectly fits the data. However, if you know all (and I mean all) of the influences in your system, then you probably don't need a regression. You likely have an analytic solution that perfectly predicts new information.
In actuality, the information you possess will fall short of this perfect level. Noise (measurement error, partial observability, etc) will cause deviation in your data. In response, a regression (or other fitting mechanism) should seek the general trend of the data while minimizing the influence of noise.
Actually, the statement is not quite correct as written. It is perfectly fine to match 100% of your data if your hypothesis function is linear. Every continuous nonlinear function may be approximated locally by a linear function which gives important information on it's local behavior.
It is also fine to match 100 points of data to a quadratic curve if that data matches 100%. You can have high confidence that you are not overfitting your data, since the data consistently shows quadratic behavior.
However, one can always get 100% fit by using a polynomial function of high enough degree. Even without the noise that others have pointed out, though, you shouldn't assume your data has some high degree polynomial behavior without having some kind of theoretical or experimental confirmation of that hypothesis. Two good indicators that polynomial behavior is indicated are:
You have some theoretical reason for expecting the data to grow as x^n in one of the directional limits.
You have data that has been supporting a fixed degree polynomial fit as more and more data has been collected.
Notice, though, that even though exponential and reciprocal relationships may have data that fits a polynomial of high enough degree, they don't tend to obey eith of the two conditions above.
The point is that your data fit needs to be useful to prediction. You always know that a linear fit will give information locally, but that information becomes more useful the more points are fit. Even if there are only two points and noise, a linear fit still gives the best theoretical look at the data collected so far, and establishes the first expectations of the data. Beyond that, though, using a quadratic fit for three points or a cubic fit for four is not validly giving more information, as it assumes both local and asymptotic behavior information with the addition of one point. You need justification for your hypothesis function. That justification can come from more points or from theory.
(A third reason that sometimes comes up is
You have theoretical and experimental reason to believe that error and noise do not contribute more than some bounds, and you can take a polynomial hypothesis to look at local derivatives and the behavior needed to match the data.
This is typically used in understanding data to build theoretical models without having a good starting point for theory. You should still strive to use the smallest polynomial degree possible, and look to substitute out patterns in the coefficients with what they may indicate (reciprocal, exponential, gaussian, etc.) in infinite series.)
Try imagining it this way. You have a function from which you pick n different values to represent a sample / training set:
y(n) = x(n), n is element of [0, 1]
But, since you want to build a robust model, you want to add a little noise to your training set, so you actually add a little noise when generating the data:
data(n) = y(n) + noise(n) = x(n) + u(n)
where by u(n) I marked a uniform random noise with a mean 0 and standard deviation 1: U(0,1). Quite simply, it's a noise signal which is most probable to take an value 0, and less likely to take a value farther it is from 0.
And then you draw, let's say, 10 points to be your training set. If there was no noise, they would all be lying on a line y = x. Since there was noise, the lowest degree of polynomial function that can represent them is probably of 10-th order, a function like: y = a_10 * x^10 + a_9 * x^9 + ... + a_1 * x + a_0.
If you consider, by just using an estimation of the information from the training set, you would probably get a simpler function than the 10-th order polynomial function, and it would have been closer to the real function.
Consider further that your real function can have values outside the [0, 1] interval but for some reason the samples for the training set could only be collected from this interval. Now, a simple estimation would probably act significantly better outside the interval of the training set, while if we were to fit the training set perfectly, we would get an overfitted function that meandered with lots of ups and downs all over :)
Overfitting is termed as bad due to the bais it has to the true solution. The solution which is overfit is 100% fitting to the training data which is used but with any small data point addition the model will change drastically. This is called variance of the model. Hence the bais-variance tradeoff where we try to have a balance between both the factors so that, the model does not change drastically on small data changes but also reasonably properly predicts the output.

Is there a special type of multivariate regression for multiple-parameter predictions?

I am trying using multivariate regression to play basketball. Specificlly, I need to, based on X, Y, and distance from the target, predict the pitch, yaw, and cannon strength. I was thinking of using multivariate regression with multipule variables for each of the output parameter. Is there a better way to do this?
Also, should I use solve directly for the best fit, or use gradient descent?
ElKamina's answer is correct but one thing to note about this is that it is identical to doing k independent ordinary least squares regressions. That is, the same as doing a separate linear regression from X to pitch, from X to yaw, and from X to strength. This means, you are not taking advantage of correlations between the output variables. This may be fine for your application, but one alternative that does take advantage of correlations in the output is reduced rank regression(a matlab implementation here), or somewhat related, you can explicitly uncorrelate y by projecting it onto its principle components (see PCA, also called PCA whitening in this case since you aren't reducing the dimensionality).
I highly recommend chapter 6 of Izenman's textbook "Modern Multivariate Statistical Techniques: Regression, Classification, and Manifold Learning" for a fairly high level overview of these techniques. If you're at a University it may be available online through your library.
If those alternatives don't perform well, there are many sophisticated non-linear regression methods that have multiple output versions (although most software packages don't have the multivariate modifications) such as support vector regression, Gaussian process regression, decision tree regression, or even neural networks.
Multivariate regression is equivalent to doing the inverse of the covariance of the input variable set. Since there are many solutions to inverting the matrix (if the dimensionality is not very high. Thousand should be okay), you should go directly for the best fit instead of gradient descent.
n be the number of samples, m be the number of input variables and k be the number of output variables.
X be the input data (n,m)
Y be the target data (n,k)
A be the coefficients you want to estimate (m,k)
XA = Y
X'XA=X'Y
A = inverse(X'X)X'Y
X' is the transpose of X.
As you can see, once you find the inverse of X'X you can calculate the coefficients for any number of output variables with just a couple of matrix multiplications.
Use any simple math tools to solve this (MATLAB/R/Python..).

Resources