Source:- https://machinelearningmastery.com/boosting-and-adaboost-for-machine-learning/
AdaBoost can be used to boost the performance of any machine learning
algorithm. It is best used with weak learners. These are models that
achieve accuracy just above random chance on a classification problem.
I did not understand what the highlighted( bold and italic ) part of the above text is trying to say. Can someone kindly explain it?
Consider a two-class problem, performance based on chance alone is 0.5 (1/2). So, you need to select a weak classifier that is right greater than or equal to half the times.
Let us say you have some classifier that can give you a performance of 0.51. You follow the steps as in the article you have read already, and with the addition of each weak classifier, the performance improves.
The reason why they mention it is best used with weak learners is that you get the highest 'benefit' from that, in terms of computational complexity and performance tradeoff from a practical view point. If you already had a classifier that was say 0.9 accuracy, then, the gain out of boosting would not be as much as starting with a classifier that had say 0.51.
Related
I'm hoping someone with a lot more knowledge of machine learning can help me out here. I've been reading examples of regression and classification and I always seem to come back to the question 'what is really the difference between what this algorithm is doing and what standard statistical analysis would do'.
Specifically, none of the examples I read seem to discuss the predictive element. For example, when looking at linear regression the articles commonly explain the concept of trying to create a 'best fit' - the combination of a linear equation and then iterating a cost function until it reaches a minimum. Of course, throughout a lot of emphasis is put on a 'training data set'. No problem... but this is usually where it ends. At this point I can't see the difference between the above and the standard way in which one would carry out statistical analysis on a data set that was assumed to have a linear relationship. Presumably, future values here are 'predicted' from the equation that was produced when the cost function converged on a minimum - again, there doesn't seem to be much 'learning' here as this is exactly what would be done in the usual case.
After a long winded intro... what I'm trying to ask is how has the algorithm learned from the original training data? and how does this training set help with future data sets? (again, this is where I get a bit lost - to me it seems that you would give it a new data set and carry out the same task of minimising the cost function - however, this time you have a better 'starting' point but all of your knowledge really comes from what you already 'knew' about the dataset i.e that one assumed a linear relationship).
I hope this makes sense - it's clearly a lack of understanding, but I'm hoping someone can shove me in the right direction.
Thanks!
You are right, there is no difference. Linear regression is purely a statistical method, and "fitting" would probably be more accurate than "learning" in this case. But again, this is usually just the first lecture on the subject. There many approaches where the differences are much clearer, for example SVMs. There are also approaches where the "learning" aspect is much clearer, eg using reirforcement learning in games, where you can actually see your system improve its performance with experience.
Anyway, the main subject of machine learning is learning from examples. You are given a list of 100 patients, along with blood pressure, age, cholesterol level etc, and for each of them you are told whether they have heart disease or not. Then, you are given a patient that you had not seen before. Does he have heart disease?? Most people call this prediction. You might prefer to call it fitting, or anything else. But the fact is, it usually works quite well.
Still, the subject remains closely tied to statistics, and indeed, you need to make some assumptions (to a larger or smaller extent, depending on the algorithm) about the underlying function. It is not perfect, but in many cases it's the best thing we have, so I would say it is worth studying. If you are starting now, there is a great online course, Stanford's "Statistical Learning", which deals with the subject from your point of view.
How do you find an optimum learning rule for a given problem, say a multiple category classification?
I was thinking of using Genetic Algorithms, but I know there are issues surrounding performance. I am looking for real world examples where you have not used the textbook learning rules, and how you found those learning rules.
Nice question BTW.
classification algorithms can be classified using many Characteristics like:
What does the algorithm strongly prefer (or what type of data that is most suitable for this algorithm).
training overhead. (does it take a lot of time to be trained)
When is it effective. ( large data - medium data - small amount of data ).
the complexity of analyses it can deliver.
Therefore, for your problem classifying multiple categories I will use Online Logistic Regression (FROM SGD) because it's perfect with small to medium data size (less than tens of millions of training examples) and it's really fast.
Another Example:
let's say that you have to classify a large amount of text data. then Naive Bayes is your baby. because it strongly prefers text analysis. even that SVM and SGD are faster, and as I experienced easier to train. but these rules "SVM and SGD" can be applied when the data size is considered as medium or small and not large.
In general any data mining person will ask him self the four afomentioned points when he wants to start any ML or Simple mining project.
After that you have to measure its AUC, or any relevant, to see what have you done. because you might use more than just one classifier in one project. or sometimes when you think that you have found your perfect classifier, the results appear to be not good using some measurement techniques. so you'll start to check your questions again to find where you went wrong.
Hope that I helped.
When you input a vector x to the net, the net will give an output depend on all the weights (vector w). There would be an error between the output and the true answer. The average error (e) is a function of the w, let's say e = F(w). Suppose you have one-layer-two-dimension network, then the image of F may look like this:
When we talk about training, we are actually talking about finding the w which makes the minimal e. In another word, we are searching the minimum of a function. To train is to search.
So, you question is how to choose the method to search. My suggestion would be: It depends on how the surface of F(w) looks like. The wavier it is, the more randomized method should be used, because the simple method based on gradient descending would have bigger chance to guide you trapped by a local minimum - so you lose the chance to find the global minimum. On the another side, if the suface of F(w) looks like a big pit, then forget the genetic algorithm. A simple back propagation or anything based on gradient descending would be very good in this case.
You may ask that how can I know how the surface look like? That's a skill of experience. Or you might want to randomly sample some w, and calculate F(w) to get an intuitive view of the surface.
I am implementing my M.Sc dissertation and in theory aspect of my thesis, i have a big problem.
suppose we want to use genetic algorithms.
we have 2 kind of functions :
a) some functions that have relations like this : ||x1 - x2||>>||f(x1) - f(x2)||
for example : y=(1/10)x^2
b) some functions that have relations like this : ||x1 - x2||<<||f(x1) - f(x2)||
for example : y=x^2
my question is that which of the above kind of functions have more difficulties than other when we want to use genetic algorithms to find optimum ( never mind MINIMUM or MAXIMUM ).
Thank you a lot,
Armin
I don't believe you can answer this question in general without imposing additional constraints.
It's going to depend on the particular type of genetic algorithm you're dealing with. If you use fitness proportional (roulette-wheel) selection, then altering the range of fitness values can matter a great deal. With tournament selection or rank-biased selection, as long as the ordering relations hold between individuals, there will be no effects.
Even if you can say that it does matter, it's still going to be difficult to say which version is harder for the GA. The main effect will be on selection pressure, which causes the algorithm to converge more or less quickly. Is that good or bad? It depends. For a function like f(x)=x^2, converging as fast as possible is probably great, because there's only one optimum, so find it as soon as possible. For a more complex function, slower convergence can be required to find good solutions. So for any given function, scaling and/or translating the fitness values may or may not make a difference, and if it does, the difference may or may not be helpful.
There's probably also a No Free Lunch argument that no single best choice exists over all problems and optimization algorithms.
I'd be happy to be corrected, but I don't believe you can say one way or the other without specifying much more precisely exactly what class of algorithms and problems you're focusing on.
I tried naive bayes classifier and it's working very bad. SVM works a little better but still horrible. Most of the papers which i read about SVM and naive bayes with some variations(n-gram, POS etc) but all of them gives results close to 50% (authors of articles talk about 80% and high but i cannt to get same accurate on real data).
Is there any more powerfull methods except lexixal analys? SVM and Bayes suppose that words independet. These approach called "bag of words". What if we suppose that words are associated?
For example: Use apriory algorithm to detect that if sentences contains "bad and horrible" then 70% probality that sentence is negative. Also we can use distance between words and so on.
Is it good idea or i'm inventing bicycle?
You're confusing a couple of concepts here. Neither Naive Bayes nor SVMs are tied to the bag of words approach. Neither SVMs nor the BOW approach have an independence assumption between terms.
Here's some things you can try:
include punctuation marks in your bags of words; esp. ! and ? can be helpful for sentiment analysis, while many feature extractors geared toward document classification throw them away
same for stop words: words like "I" and "my" may be indicative of subjective text
build a two-stage classifier; first determine whether any opinion is expressed, then whether it's positive or negative
try a quadratic kernel SVM instead of a linear one to capture interactions between features.
Algorithms like SVM, Naive Bayes and maximum entropy ones are supervised machine learning algorithms and the output of your program depends on the training set you have provided.
For large scale sentiment analysis I prefer using unsupervised learning method in which one can determine the sentiments of the adjectives by clustering documents into same-oriented parts, and label the clusters positive or negative. More information can be found out from this paper.
http://icwsm.org/papers/3--Godbole-Srinivasaiah-Skiena.pdf
Hope this helps you in your work :)
You can find some useful material on Sentimnetal analysis using python.
This presentation summarizes Sentiment Analysis as 3 simple steps
Labeling data
Preprocessing &
Model Learning
Sentiment Analysis is an area of ongoing research. And there is a lot of research going on right now. For an overview of the most recent, most successful approaches, I would generally advice you to have a look at the shared tasks of SemEval. Usually, every year they run a competition on Sentiment Analysis in Twitter. You can find the paper describing the task, and the results for 2016 here (might be a bit technical though): http://alt.qcri.org/semeval2016/task4/data/uploads/semeval2016_task4_report.pdf
Starting from there, you can have a look in the papers describing the individual systems (as referenced there).
I am a non technical person, who is trying to implement image classification. In this paper, I came across the ADA Boost algorithm, which was implemented after the 'bag of features' step for video keyframes. Can someone explain in layman terms what the ADA Boost does, and what is its input and output? Can someone point me out to code for the same?
First off, it would be nice to if you could link/name the paper you are referring to.
AdaBoost is a meta classification algorithm, as it combines multiple classifiers called weak learners. These weak learners are often really simple, e.g. they only classify the data based on one feature, and perform slightly better than random.
In image classification, AdaBoost will use as input a data set of images (with corresponding labels depicting to which class each sample belongs) and a set of weak learners. AdaBoost will then find the weak learner with the lowest error rate (i.e. best results) on the data. All correctly classified data samples are now given a lower weight as they are now less important, while the miss-classified samples are given a higher weight. AdaBoost will now start a new round and selects the best weak learner based on the newly weighted data. In other words, it will find a new weak learner which is better at classifying the samples which the previously selected weak learners were not able to classify.
The algorithm will continue with selecting these weak learners for a specified amount of iterations. The output consists of the group of selected weak learners. The learned classifier can now classify new images based on a majority vote of each weak classifier in the group (often the weak classifiers themselves are also weighted based on their achieved error rate).
You might want to take a look at software that have already implemented AdaBoost, like WEKA or the the computer vision orientated OpenCV.
Adaboost takes a bunch of weak classifiers and combines them to form a strong classifier. The outputs are a sequence of weights w_i for the weak classifiers used in the summand to form a single weighted classifier. There are many intermediate outputs from the algorithm, but maybe the most important is the weights themselves.
Although it wasn't originally conceived that way, Adaboost is equivalent to fitting a "forward stagewise" model on the training set, using the weak classifiers at each step using an exponential loss function: L(y,f(x)) = exp(-y*f(x)), where f(.) is our classifier. Viewed this way, some aspects of the algorithm are clearer. The exponential loss function is often used for classification problems, for good reason.