Related
I'm trying to understand how random forest works in plain English instead of mathematics. Can anybody give me a really simple explanation of how this algorithm works?
As far as I understand, we feed the features and labels without telling the algorithm which feature should be classified as which label? As I used to do Naive Bayes which is based on probability we need to tell which feature should be which label. Am I completely far off?
If I can get any very simple explanation I'd be really appreciated.
RandomForest uses a so-called bagging approach. The idea is based on the classic bias-variance trade off. Suppose that we have a set (say N) of overfitted estimators that have low bias but high cross-sample-variance. So low bias is good and we want to keep it, high variance is bad and we want to reduce it. RandomForest tries to achieve this by doing a so-called bootstraps/sub-sampling (as #Alexander mentioned, this is a combination of bootstrap sampling on both observations and features). The prediction is the average of individual estimators so the low-bias property is successfully preserved. And further by Central Limit Theorem, the variance of this sample average has a variance equal to variance of individual estimator divided by square root of N. So now, it has both low-bias and low-variance properties, and this is why RandomForest often outperforms stand-alone estimator.
Adding on to the above two answers, Since you mentioned a simple explanation. Here is a write up that I feel is the most simple way you can explain random forests.
Credits go to Edwin Chen for the simple explanation here in layman terms for random forests. Posting the same below.
Suppose you’re very indecisive, so whenever you want to watch a movie, you ask your friend Willow if she thinks you’ll like it. In order to answer, Willow first needs to figure out what movies you like, so you give her a bunch of movies and tell her whether you liked each one or not (i.e., you give her a labeled training set). Then, when you ask her if she thinks you’ll like movie X or not, she plays a 20 questions-like game with IMDB, asking questions like “Is X a romantic movie?”, “Does Johnny Depp star in X?”, and so on. She asks more informative questions first (i.e., she maximizes the information gain of each question), and gives you a yes/no answer at the end.
Thus, Willow is a decision tree for your movie preferences.
But Willow is only human, so she doesn’t always generalize your preferences very well (i.e., she overfits). In order to get more accurate recommendations, you’d like to ask a bunch of your friends and watch movie X if most of them say they think you’ll like it. That is, instead of asking only Willow, you want to ask Woody, Apple, and Cartman as well, and they vote on whether you’ll like a movie (i.e., you build an ensemble classifier, aka a forest in this case).
Now you don’t want each of your friends to do the same thing and give you the same answer, so you first give each of them slightly different data. After all, you’re not absolutely sure of your preferences yourself – you told Willow you loved Titanic, but maybe you were just happy that day because it was your birthday, so maybe some of your friends shouldn’t use the fact that you liked Titanic in making their recommendations. Or maybe you told her you loved Cinderella, but actually you really really loved it, so some of your friends should give Cinderella more weight. So instead of giving your friends the same data you gave Willow, you give them slightly perturbed versions. You don’t change your love/hate decisions, you just say you love/hate some movies a little more or less (formally, you give each of your friends a bootstrapped version of your original training data). For example, whereas you told Willow that you liked Black Swan and Harry Potter and disliked Avatar, you tell Woody that you liked Black Swan so much you watched it twice, you disliked Avatar, and don’t mention Harry Potter at all.
By using this ensemble, you hope that while each of your friends gives somewhat idiosyncratic recommendations (Willow thinks you like vampire movies more than you do, Woody thinks you like Pixar movies, and Cartman thinks you just hate everything), the errors get canceled out in the majority. Thus, your friends now form a bagged (bootstrap aggregated) forest of your movie preferences.
There’s still one problem with your data, however. While you loved both Titanic and Inception, it wasn’t because you like movies that star Leonardo DiCaprio. Maybe you liked both movies for other reasons. Thus, you don’t want your friends to all base their recommendations on whether Leo is in a movie or not. So when each friend asks IMDB a question, only a random subset of the possible questions is allowed (i.e., when you’re building a decision tree, at each node you use some randomness in selecting the attribute to split on, say by randomly selecting an attribute or by selecting an attribute from a random subset). This means your friends aren’t allowed to ask whether Leonardo DiCaprio is in the movie whenever they want. So whereas previously you injected randomness at the data level, by perturbing your movie preferences slightly, now you’re injecting randomness at the model level, by making your friends ask different questions at different times.
And so your friends now form a random forest.
I will try to give another complementary explanation with simple words.
A random forest is a collection of random decision trees (of number n_estimators in sklearn).
What you need to understand is how to build one random decision tree.
Roughly speaking, to build a random decision tree you start from a subset of your training samples. At each node you will draw randomly a subset of features (number determined by max_features in sklearn). For each of these features you will test different thresholds and see how they split your samples according to a given criterion (generally entropy or gini, criterion parameter in sklearn). Then you will keep the feature and its threshold that best split your data and record it in the node.
When the construction of the tree ends (it can be for different reasons: maximum depth is reached (max_depth in sklearn), minimum sample number is reached (min_samples_leaf in sklearn) etc.) you look at the samples in each leaf and keep the frequency of the labels.
As a result, it is like the tree gives you a partition of your training samples according to meaningful features.
As each node is built from features taken randomly, you understand that each tree built in this way will be different. This contributes to the good compromise between bias and variance, as explained by #Jianxun Li.
Then in testing mode, a test sample will go through each tree, giving you label frequencies for each tree. The most represented label is generally the final classification result.
This is a long shot, but I thought I might try before starting the dirty work.
I've got a project to build an application which will, for a defined input stations (vertices) and lines (edges), that is, a real map of some public transportation, schematize a given map into a metro map. I've done some research on the problem and it's an NP-complete problem equivalent to the 3-SAT problem. I also have some theoretic ideas on how to generate such a map, but they aren't detailed enough.
What I'm looking for is any other existing solution of this problem, some sort of pseudo-code, some real code in (almost) any other programming language etc, anything that would reduce the time I need to spend working on the algorithm itself, which will in return give me more time to work on other aspects of the application.
If anyone has ever seen anything that can help me, I'd appreciate it very much.
If you google for "metro map layout problem" and "metro map line crossing" you'll find a lot of references, since it has been researched very actively in the past 10 years.
The problem seems no trivial at all, and translating the "artistic" features to mathematical constraints is seemingly one of the most difficult tasks.
Anyway here are three publications that I found interesting to start with (among many, many others):
Metro Map Layout Using Multicriteria Optimization
Line Crossing Minimization on Metro Maps
The Metro Map Layout Problem
HTH!
Research that's similar to your topic: http://graphics.stanford.edu/papers/routemaps/
This is just some suggestion with handwaving - take with a pinch of salt.
My notion of a "metro" map is one where lines tend to one of the eight cardinal directions and stations are regularly spaced.
I'm assuming you're trying to convert a set of real coordinates into "metro" coordinates.
I would start with your main route (e.g., a city loop), then incrementally add other routes in order of importance.
For each route you want to find the nearest approximation that uses the fewest number of straight lines travelling in the eight cardinal directions. You might do this by starting with the bounding box for the real coordinates, splitting that into a grid, then finding a "metro" route from grid square to grid square, then successively refining that route to reduce the number of bends without distorting the map too much and without introducing crossings with other routes if at all possible.
Having done that, scale each line so that consecutive stations are the same distance apart on the "metro" view.
My guess is you'll still want to support manual tweaking of the result.
Good luck!
Feels like a planning problem.
Looks like your hard constraints are:
Every station must be on a point. A points are on a grid with a distance of X between points (I'd make this static on 2cm)
There should not be 2 stations on the same spot
There should be enough room to draw the station label. Note that the label can be assigned different directions from the point to which the station is assigned.
There should be enough room to draw the subway lines.
Looks like your soft constraints are:
For each station, minimize the actually geographical location distance to the point assigned to the station.
Then throw something like Drools Planner on it, here's an example of hard and soft constraints for nurse rostering.
My friend works for a non-profit organization working to stop the illegal exploitation of minors over sites such as craigslist.org, which is one of the more popular mediums. The question is whether or not it is possible, now or in the near future, to develop an algorithm to analyze a photo of a person and return a prediction of their relative age.
It sounds like a mammoth task. My only thought was some sort of Bayesian probability system. I know even people often have trouble judging someone's age but Bayesian spam filters are advertised as being "10 times as accurate as a human" so maybe it's possible?
I am pretty inexperienced though. I would appreciate it if someone else could suggest whether or not this is feasible and if so how and when?
EDIT: Thank you everyone for the responses. Smoore that study was very helpful but I think Hal's solution is the most practical for the time being.
Here's a possible (left-field) solution. Perhaps, you could tie it into some type of a captcha solution for the site itself. Prompt new users with images of other new users with the question: "Is this person over 18?". It's true that a 50% success rate is not a very effective captcha system, but it's a start.
Coupled with some other checks or repetitive checks and it could work. You could display the image to a number of new users, and base the result on a certain threshold. If, 8 out of 10 people flagged a certain image as not a minor, than it's probably pretty safe they are of age.
But, this whole system can be circumvented by simply uploading someone else's image so I'm not sure how effective any of this really is. :)
I expect it would be pretty hard to get right. Consider this set of photos where the same model is made up to look very different ages.
There are algorithm to reliably determine the attractiveness of a face. See acm.org and uni-regensburg.de. It wouldn't be too much of a stretch to imagine an algorithm which could predict age.
Characteristics such as smoothness would probably have a strong correlation with age. It would probably take a great deal of effort to be more reliable than your average carney though.
I think you would need some input from a forensic anthropoligist ( or at least an anatomist).
Differnet parts of the body grow at different rates so it might be possible to do something like size of head vs. shoulder width, arm length vs. body width.
Unfortunately it sounds like he is trying to differentiate between say a 14 year olds and 18 year olds. Which is only a four year difference, variations in genetic makeup and nutitrition would probaly give any system an accuracy of +/- 20% which would equate to three years for this age group.
On the other hand if you had a large sample of photos then you could account for the variance statisticaly and get a pretty good idea whether a site was likely to be exploiting minors systematicaly.
The direct answer to your question is that no, no such algorithm will exist in the near future, and is probably impossible to achieve with any accuracy without strong AI.
That said, a practical solution to your problem is probably the amazon mechanical turk:
http://mturk.com
There, you can pay a small fee to have real people complete a task for you. I'd probably set your task up so that you paid $0.02 to have a person estimate the age of maybe 5 faces at a time. You could double or triple check your results with other workers, particularly for those faces who seemed close to your age limit. This is probably your only practical solution other than hiring minimum wage interns to manually review all submissions.
Use mechanical turk
In this study they tried it by analysing facial geometry and wrinkle features. Problem is this would be affected by shot angle, lighting, etc.
In some theoretical sense it is probably possible. For all practical purposes though, it is currently impossible.
Mammoth is an understatement I think. "Giant glacier" or "moon" might be more appropriate.
This isn't to say it wouldn't be worth looking into but I have a feeling you'd be in for a lot of man hours before you came up with something remotely useful.
I don't think it's something that a computer could do with any degree of accuracy. It's even really hard for people to do. I mean, have you been the the liquor store lately, they are supposed to ask for ID from anybody who looks under 25 (drinking age is 19 here). Apparently some 40 year olds don't look old enough. Telling somebody's age just by looking at them is a very hard thing to do. Especially when you get into to erotic picture arena, where they are trying to make models seem younger than they really are.
I think you will also have difficulties with different composited pictures. For instance angles on a face, different lighting, as well as context and probably most of all... image quality/resolution. It's a lot easier to work with a 800x600 pic then it is to work with a 320x240. The algorithm is only as good as the subject.
I cannot see this approach (a software solution to measuring age) being very effective. I like the idea of users flagging images - a human being can discern age many times more effectively then any algorithm.
Practical approach aside, I'd advice against trying to develop anything in that direction for now.
Few reasons:
1. guessing someone's age is not a grateful task
2. "biological" age and "calendar" age of people vary greatly - I know people who are 30 and are still asked for an ID when buying liquor, and some who are barely 18 and already look over 30
3. some people's looks don't change over time - they just have that kind of looks
4. nowadays, everyone's working to look as young as they can - so basically, you've got the whole industry working against you :(
Anyways, to cut long story short, I don't think it's feasible for now.
A neural net is a reasonable approach, you would need a training set of pictures of people with known ages and a bit of image processing to remove hats etc.
edit: Question changed?
You might be ale to classify someone as 20-30 or 40-50 on a CCTV but you aren't going to be ale to tell if a model is 17 or 18 in a posed photo.
Just like nearly all advanced tasks in image classification this topic is still in research. Judging from this paper it is possible to do it but non-trivial, also you have to have a lot of (manually) annotated training data. Without any knowledge of this field and no experience in image processing this task is going to take you several months.
Develop a classification algorithm that bases a heuristic on many values of the pictures, amount of pixels that are dark within the face area (possibly wrinkles), and the color of the hair. These values should fall within a general area of any profile-esque picture, if you want to be fancy, carry weights with these values and develop a type of game tree that would be able to search hundreds of thousands of images quickly, finding where this image "falls" in the tree within an age-specific set of values.
Some Japanese cigarette vending machines do this. Not terribly well by all accounts, but then it probably doesn't matter since, as Hal mentioned, the easiest hack is just to use someone else's image...
Impossible is nothing, Only amount of efforts changes :
I think it would be near impossible if you target one particular feature of face.
you have to consider multiple factor, So decision will be lying in a matrix and you have to feed multiple things and you will get your answer i would enlist some feature :
1) Beard (Detect face , Now detect beard on face , Help full in distinguish male/female
/childern )
2) Hair
3) Wrinkles
4) Size of face
5) Ration between height and breadth of face
It would be a tough assignment but algorithm can be developed.
As of now, this is possible with 90% accuracy. Yes. please refer the following link..
http://www.omron.com/r_d/coretech/vision/okao.html
I'm building, for example, an application that monitors your health. Each day, you're jogging and doing push-ups and you enter the information on a web site.
What I would like to do is building a chart combining the hours you jogged and the number of push-ups/sit-ups you did. Let's say on the first day, you jogged 1 hour and did 10 push-ups and on the second day, you jogged 50 minutes and did 20 push-ups, you would see a progression in your training.
I know it may sound strange but I want to have an overall-view of your health, not different views for jogging and push-ups. I don't want a double y-axis chart because if I have, as example, 6 runners, I will end up with 12 lines on the chart.
First I would redefine your terms. You are not tracking "health" here, you are tracking level of exertion through exercise.
Max Exertion != Max Health. If you exert yourself to the max and don't eat or drink, you will actually damage your health. :-)
To combine and plot your total "level of exertion" for multiple exercises you need to convert them to a common unit ... something like "calories burned".
I'm pretty sure there are many sources for reference tables with rough conversion factors for how many calories various exercises burn.
Does that help any?
Then you need a model of how push-ups and jogging affect yourself, and for this you should be asking a doctor or fitness expert, not a programmer :-). This question should probably be taken elsewhere.
Sounds like a double y-axis chart.
You can just do a regular excel-type chart with 2 lines, scaled appropriately, one for push-ups, one for jogging time. There are graphics libraries that let you do that in back-end language of your choice. X-axys is date.
You may want to have 2 scaled graphs, one for last week and one for last year (ala Yahoo Finance charts for different intervals).
Show the first set of values as a line graph above the x axis, and the second set below the x axis. If both sets of values increase over time this will show as an "expansion" of the graph; should be easy to recognize if one set is growing but the other is not.
Because the two quantities have no intrinsic relationship, you're stuck with either displaying them independently, such as two curves with two y-axes, or making up a measure that combines them, such as an estimate of calories burned, muscles used, mental anguish from exercising, etc. But it's tricky... taking from your example, I suspect one will never approach the calories burned from a 50 mile run by doing push-ups. Combining these in a meaningful way depends not on mathematics but on approximations and knowledge of the quantities that you start with and are interested in.
One compromise might be a graph with a single y-axis that shows some combined quantity, but where the independent values at each point are also graphically represented, for example, by a line where the local color represents the ratio of miles to pushups, or any of the many variants that display information in the shapes or colors in the plot.
Another option is to do a 3D plot, and then rotate it around and look for trends or whatever interests you.
If you want one overall measure of exercise levels, you could try using total exercise time. Another alternative is to define a points system, whereby users score points for each exercise.
I do think that there is virtue in letting the users see how much of each individual exercise they have done - in this case use a different graph for different exercises rather than using dual y-axes, if the scales are not comparable (e.g. time jogging and number of push-ups). There is a very good article on the problems with dual y-axes by business intelligence guru Stephen Few, here (pdf).
If you want to know more about presenting data well, I can also recommend his book "Now you see it", and the classic "The Visual Display of Quantitative Information" by Edward Tufte.
I'm thinking of writing an app to classify movies in an HTPC based on what the family members like.
I don't know statistics or AI, but the stuff here looks very juicy. I wouldn't know where to start do.
Here's what I want to accomplish:
Compose a set of samples from each users likes, rating each sample attribute separately. For example, maybe a user likes western movies a lot, so the western genre would carry a bit more weight for that user (and so on for other attributes, like actors, director, etc).
A user can get suggestions based on the likes of the other users. For example, if both user A and B like Spielberg (connection between the users), and user B loves Batman Begins, but user A loathes Katie Holmes, weigh the movie for user A accordingly (again, each attribute separately, for example, maybe user A doesn't like action movies so much, so bring the rating down a bit, and since Katie Holmes isn't the main star, don't take that into account as much as the other attributes).
Basically, comparing sets from user A similar to sets from user B, and come up with a rating for user A.
I have a crude idea about how to implement this, but I'm certain some bright minds have already thought of a far better solution already, so... any suggestions?
Actually, after a quick research, it seems a Bayesian filter would work. If so, would this be the better approach? Would it be as simple as just "normalizing" movie data, training a classifier for each user, and then just classify each movie?
If your suggestion includes some brain melting concepts (I'm not experienced in these subjects, specially in AI), I'd appreciate it if you also included a list of some basics for me to research before diving into the meaty stuff.
Thanks!
Matthew Podwysocki had some interesting articles on this stuff
http://codebetter.com/blogs/matthew.podwysocki/archive/2009/03/30/functional-programming-and-collective-intelligence.aspx
http://codebetter.com/blogs/matthew.podwysocki/archive/2009/04/01/functional-programming-and-collective-intelligence-ii.aspx
http://weblogs.asp.net/podwysocki/archive/2009/04/07/functional-programming-and-collective-intelligence-iii.aspx
This is similar to this question where the OP wanted to build a recommendation system. In a nutshell, we are given a set of training data consisting of users ratings to movies (1-5 star rating for example) and a set of attributes for each movie (year, genre, actors, ..). We want to build a recommender so that it will output for unseen movies a possible rating. So the inpt data looks like:
user movie year genre ... | rating
---------------------------------------------
1 1 2006 action | 5
3 2 2008 drama | 3.5
...
and for an unrated movie X:
10 20 2009 drama ?
we want to predict a rating. Doing this for all unseen movies then sorting by predicted movie rating and outputting the top 10 gives you a recommendation system.
The simplest approach is to use a k-nearest neighbor algorithm. Among the rated movies, search for the "closest" ones to movie X, and combine their ratings to produce a prediction.
This approach has the advantage of being very simple to easy implement from scratch.
Other more sophisticated approaches exist. For example you can build a decision tree, fit a set of rules on the training data. You can also use Bayesian networks, artificial neural networks, support vector machines, among many others... Going through each of these wont be easy for someone without the proper background.
Still I expect you would be using an external tool/library. Now you seem to be familiar with Bayesian Networks, so a simple naive bayes net, could in fact be very powerful. One advantage is that it allow for prediction under missing data.
The main idea would be somewhat the same; take the input data you have, train a model, then use it to predict the class of new instances.
If you want to play around with different algorithms in simple intuitive package which requires no programming, I suggest you take a look at Weka (my 1st choice), Orange, or RapidMiner. The most difficult part would be to prepare the dataset to the required format. The rest is as easy as choosing what algorithm and applying it (all in a few clicks!)
I guess for someone not looking to go into too much details, I would recommend going with the nearest neighbor method as it is intuitive and easy to implement.. Still the option of using Weka (or one of the other tools) is worth looking into.
There are a few algorithms that are good for this:
ARTMAP: groups via probability against each other (this isn't fast but its the best thing for your problem IMO)
ARTMAP holds a group of common attributes and determines likelyhood of simliarity via a percentages.
ARTMAP
KMeans: This seperates out the vectors by the distance that they are from each other
KMeans: Wikipedia
PCA: will seperate the average of all the values from the varing bits. This is what you would use to do face detection, and background subtraction in Computer Vision.
PCA
The K-nearest neighbor algorithm may be right up your alley.
Check out some of the work of the top teams for the netflix prize.