I know ELKI currently only includes unsupervised outlier detection methods, therefore Elki doesn't divide input data in traing set and test set.
But, i've seen evaluation is over minority class when avaiable. i would like to know:
Does elki use all input data to evaluation?
Does runtime take account evaluation or just training time?
Does evaluation take account outliers scores to estimate false positive rate and true positive rate in order to evaluate rankings?
In LOF algorithm, for example, suppose a instance in normal class has a high LOF score. will it be consider a false positive or true positive in evaluation?
Thanks!
Yes, all input is used for unsupervised methods.
The labels must not have been used for running the algorithm, they are only used at evaluation time.
Runtime reported is separately for every algorithm.
This depends on your evaluation. Most measures (e.g. ROC AUC) will only take the ranking into account. To evaluate the actual scores, you first need to normalize them. For a measure that takes (normalized) scores into account, please see
E. Schubert, R. Wojdanowski, A. Zimek, H.-P. KriegelOn Evaluation of Outlier Rankings and Outlier ScoresIn Proceedings of the 12th SIAM International Conference on Data Mining (SDM), Anaheim, CA: 1047–1058, 2012.
True positive and false positives require a binary decision. See ROC AUC for an approach that does not require to specify a threshold to make the decision binary, but evaluate all possible thresholds.
Related
In the Applied Predictive Modeling book, cost sensitivity learning approach, the author(s) write:
One consequence of this approach is that class probabilities cannot be
generated for the model, at least in the available implementation.
Therefore we cannot calculate an ROC curve and must use a different
performance metric. Instead we will now use the Kappa statistic,
sensitivity, and specificity to evaluate the impact of weighted
classes.
Can you explain to me why not ROC/AUC but Kappa Statistic, sensitivity and specificity instead? I think sensitivity or specificity is also ROC or AUC?
Link for the book: https://cloudflare-ipfs.com/ipfs/bafykbzacedepga3g6t7b6rq6irwhy5gzpc47bamquhygup4eqggidvkjcztqs?filename=Max%20Kuhn%2C%20Kjell%20Johnson%20-%20Applied%20Predictive%20Modeling-Springer%20%282013%29.pdf
I'm working on knowledge graph, more precisely in natural language processing field. To evaluate the components of my algorithm, it is necessary to be able to classify the good and the poor candidates. For this purpose, we manually classified pairs in a dataset.
My system returns the relevant pairs according to the implementation logic. now I'm able to calculate :
Precision = X
Recall = Y
For establishing a complete curve I need the rest of points (X,Y), what should I do?:
build another dataset for test ?
split my dataset ?
or any other solution ?
Neither of your proposed two methods. In short, Precision-recall or ROC curve is designed for classifiers with probabilistic output. That is, instead of simply producing a 0 or 1 (in case of binary classification), you need classifiers that can provide a probability in [0,1] range. This is the function to do it in sklearn, note how the 2nd parameter is called probas_pred.
To turn this probabilities into concrete class prediction, you can then set a threshold, say at .5. Setting such a threshold is problematic however, since you can trade-off precision/recall by varying the threshold, and an arbitrary choice can give false impression of a classifier's performance. To circumvent this, threshold-independent measures like area under ROC or Precision-Recall curve is used. They create thresholds at different intervals, say 0.1,0.2,0.3...0.9, turn probabilities into binary classes and then compute precision-recall for each such threshold.
In a typical genetic algorithm, is there any guideline for estimating the generations required to converge given the amount of entropy in the description of an individual in the population?
Also, I suppose it is reasonable to also require the number of offspring per generation and rate of mutation, but adjustment of those parameters is of less interest to me at the moment.
Well, there are not any concrete guidelines in the form of mathematical models, but there are several concepts that people use to communicate about parameter settings and advice on how to choose them. One of these concepts is diversity, which would be similar to the entropy that you mentioned. The other concept is called selection pressure and determines the chance an individual has to be selected based on its relative fitness.
Diversity and selection pressure can be computed for each generation, but the change between generations is very difficult to estimate. You would also need models that predict the expected quality of your crossover and mutation operator in order to estimate the fitness distribution in the next generation.
There have been work published on these topics very recently:
* Chicano and Alba. 2011. Exact Computation of the Expectation Curves of the Bit-Flip Mutation using Landscapes Theory
* Chicano, Whitley, and Alba. 2012. Exact computation of the expectation curves for uniform crossover
Is your question resulting from a general research interest or do you seek practical guidence?
No. If you define a mathematical model of the algorithm (initial population, combination function, mutation function) you can use normal mathematical methods to calculate what you want to know, but "typical genetic algorithm" is too vague to have any meaningful answer.
If you want to set the hyperparameters of some genetic algorithm (eg number of "DNA" bits) than this is typically done in the usual way for any machine learning algorithm, with a cross validation set.
I have some measurements on a system (x, y, z, ..) over many trials. The system produces a true or false output. I would like to take my data and produce a predictor function of x,y, z, that would best predict the system outcome.
I am used to methods for approximating smooth outcomes like approximating a graph, but don't know the terms to search for when the outcome is true/false.
Search for multivariate classification.
In your case you just have two classes (true and false).
The Wikipedia article on statistical classification has a list of commonly used algorithms.
You can also search for multivariate regression which attempts to model a real value as function of several values where in your case the possible values are a discrete set (0,1).
One would have to take a decision on whether the predicted outcome is True or False based on the regression function's output (e.g. assume True if the output is > 0.5 and False if it's <= 0.5).
Note that there is also https://stats.stackexchange.com/ where you could get more detailed answers related to the analysis of data.
You are basically wanting the probability of TRUE or FALSE. A standard technique is logistic regression. Logistic regression is a useful way of describing the relationship between a binary response variable and some independent variables. Since the output is a probability, it is easily interpretable.
There are standard libraries in most languages to implement logistic regression.
A neural network seems like a perfect fit for your problem.
I'm using the ruby classifier gem whose classifications method returns the scores for a given string classified against the trained model.
Is the score a percentage? If so, is the maximum difference 100 points?
It's the logarithm of a probability. With a large trained set, the actual probabilities are very small numbers, so the logarithms are easier to compare. Theoretically, scores will range from infinitesimally close to zero down to negative infinity. 10**score * 100.0 will give you the actual probability, which indeed has a maximum difference of 100.
Actually to calculate the probability of a typical naive bayes classifier where b is the base, it is b^score/(1+b^score). This is the inverse logit (http://en.wikipedia.org/wiki/Logit) However, given the independence assumptions of the NBC, these scores tend to be too high or too low and probabilities calculated this way will accumulate at the boundaries. It is better to calculate the scores in a holdout set and do a logistic regression of accurate(1 or 0) on score to get a better feel for the relationship between score and probability.
From a Jason Rennie paper:
2.7 Naive Bayes Outputs Are Often Overcondent
Text databases frequently have
10,000 to 100,000 distinct vocabulary words; documents often contain 100 or more
terms. Hence, there is great opportunity for duplication.
To get a sense of how much duplication there is, we trained a MAP Naive Bayes
model with 80% of the 20 Newsgroups documents. We produced p(cjd;D) (posterior)
values on the remaining 20% of the data and show statistics on maxc p(cjd;D) in
table 2.3. The values are highly overcondent. 60% of the test documents are assigned
a posterior of 1 when rounded to 9 decimal digits. Unlike logistic regression, Naive
Bayes is not optimized to produce reasonable probability values. Logistic regression
performs joint optimization of the linear coecients, converging to the appropriate
probability values with sucient training data. Naive Bayes optimizes the coecients
one-by-one. It produces realistic outputs only when the independence assumption
holds true. When the features include signicant duplicate information (as is usually
the case with text), the posteriors provided by Naive Bayes are highly overcondent.