Is an algorithm to judge the age of person in a photo feasible? - algorithm

My friend works for a non-profit organization working to stop the illegal exploitation of minors over sites such as craigslist.org, which is one of the more popular mediums. The question is whether or not it is possible, now or in the near future, to develop an algorithm to analyze a photo of a person and return a prediction of their relative age.
It sounds like a mammoth task. My only thought was some sort of Bayesian probability system. I know even people often have trouble judging someone's age but Bayesian spam filters are advertised as being "10 times as accurate as a human" so maybe it's possible?
I am pretty inexperienced though. I would appreciate it if someone else could suggest whether or not this is feasible and if so how and when?
EDIT: Thank you everyone for the responses. Smoore that study was very helpful but I think Hal's solution is the most practical for the time being.

Here's a possible (left-field) solution. Perhaps, you could tie it into some type of a captcha solution for the site itself. Prompt new users with images of other new users with the question: "Is this person over 18?". It's true that a 50% success rate is not a very effective captcha system, but it's a start.
Coupled with some other checks or repetitive checks and it could work. You could display the image to a number of new users, and base the result on a certain threshold. If, 8 out of 10 people flagged a certain image as not a minor, than it's probably pretty safe they are of age.
But, this whole system can be circumvented by simply uploading someone else's image so I'm not sure how effective any of this really is. :)

I expect it would be pretty hard to get right. Consider this set of photos where the same model is made up to look very different ages.

There are algorithm to reliably determine the attractiveness of a face. See acm.org and uni-regensburg.de. It wouldn't be too much of a stretch to imagine an algorithm which could predict age.
Characteristics such as smoothness would probably have a strong correlation with age. It would probably take a great deal of effort to be more reliable than your average carney though.

I think you would need some input from a forensic anthropoligist ( or at least an anatomist).
Differnet parts of the body grow at different rates so it might be possible to do something like size of head vs. shoulder width, arm length vs. body width.
Unfortunately it sounds like he is trying to differentiate between say a 14 year olds and 18 year olds. Which is only a four year difference, variations in genetic makeup and nutitrition would probaly give any system an accuracy of +/- 20% which would equate to three years for this age group.
On the other hand if you had a large sample of photos then you could account for the variance statisticaly and get a pretty good idea whether a site was likely to be exploiting minors systematicaly.

The direct answer to your question is that no, no such algorithm will exist in the near future, and is probably impossible to achieve with any accuracy without strong AI.
That said, a practical solution to your problem is probably the amazon mechanical turk:
http://mturk.com
There, you can pay a small fee to have real people complete a task for you. I'd probably set your task up so that you paid $0.02 to have a person estimate the age of maybe 5 faces at a time. You could double or triple check your results with other workers, particularly for those faces who seemed close to your age limit. This is probably your only practical solution other than hiring minimum wage interns to manually review all submissions.

Use mechanical turk

In this study they tried it by analysing facial geometry and wrinkle features. Problem is this would be affected by shot angle, lighting, etc.

In some theoretical sense it is probably possible. For all practical purposes though, it is currently impossible.

Mammoth is an understatement I think. "Giant glacier" or "moon" might be more appropriate.
This isn't to say it wouldn't be worth looking into but I have a feeling you'd be in for a lot of man hours before you came up with something remotely useful.

I don't think it's something that a computer could do with any degree of accuracy. It's even really hard for people to do. I mean, have you been the the liquor store lately, they are supposed to ask for ID from anybody who looks under 25 (drinking age is 19 here). Apparently some 40 year olds don't look old enough. Telling somebody's age just by looking at them is a very hard thing to do. Especially when you get into to erotic picture arena, where they are trying to make models seem younger than they really are.

I think you will also have difficulties with different composited pictures. For instance angles on a face, different lighting, as well as context and probably most of all... image quality/resolution. It's a lot easier to work with a 800x600 pic then it is to work with a 320x240. The algorithm is only as good as the subject.
I cannot see this approach (a software solution to measuring age) being very effective. I like the idea of users flagging images - a human being can discern age many times more effectively then any algorithm.

Practical approach aside, I'd advice against trying to develop anything in that direction for now.
Few reasons:
1. guessing someone's age is not a grateful task
2. "biological" age and "calendar" age of people vary greatly - I know people who are 30 and are still asked for an ID when buying liquor, and some who are barely 18 and already look over 30
3. some people's looks don't change over time - they just have that kind of looks
4. nowadays, everyone's working to look as young as they can - so basically, you've got the whole industry working against you :(
Anyways, to cut long story short, I don't think it's feasible for now.

A neural net is a reasonable approach, you would need a training set of pictures of people with known ages and a bit of image processing to remove hats etc.
edit: Question changed?
You might be ale to classify someone as 20-30 or 40-50 on a CCTV but you aren't going to be ale to tell if a model is 17 or 18 in a posed photo.

Just like nearly all advanced tasks in image classification this topic is still in research. Judging from this paper it is possible to do it but non-trivial, also you have to have a lot of (manually) annotated training data. Without any knowledge of this field and no experience in image processing this task is going to take you several months.

Develop a classification algorithm that bases a heuristic on many values of the pictures, amount of pixels that are dark within the face area (possibly wrinkles), and the color of the hair. These values should fall within a general area of any profile-esque picture, if you want to be fancy, carry weights with these values and develop a type of game tree that would be able to search hundreds of thousands of images quickly, finding where this image "falls" in the tree within an age-specific set of values.

Some Japanese cigarette vending machines do this. Not terribly well by all accounts, but then it probably doesn't matter since, as Hal mentioned, the easiest hack is just to use someone else's image...

Impossible is nothing, Only amount of efforts changes :
I think it would be near impossible if you target one particular feature of face.
you have to consider multiple factor, So decision will be lying in a matrix and you have to feed multiple things and you will get your answer i would enlist some feature :
1) Beard (Detect face , Now detect beard on face , Help full in distinguish male/female
/childern )
2) Hair
3) Wrinkles
4) Size of face
5) Ration between height and breadth of face
It would be a tough assignment but algorithm can be developed.

As of now, this is possible with 90% accuracy. Yes. please refer the following link..
http://www.omron.com/r_d/coretech/vision/okao.html

Related

What Machine Learning algorithm would be appropriate?

I am working on a predictor for learning the most likely period for grape harvesting, depending on weather and on the characteristics of grape, namely sugar level, Ph, acidity. I've got two datasets and I am thinking of how to merge them together: one is the pre-harvest analysis data of some Italian vineyards in the 2003-2013 period, the other is the weather on that decade. What I want to do is learning from my samples when to harvest, given a range for the optimal sugar level, Ph and acidity, and given a weather forecast.
I thought that some Reinforcement Learning approach could work. Since the pre-harvest analysis are done about 5 times during the grape maturation period, I thought that those could be states I step in, while the weather conditions could be the "probabilities" of going from a state to another.
Yet I am not sure of what algorithm would be the best as every state and every "probability" depends on several variables. I was told that Hidden Markov Model would work, but it seems to me that my problem doesn't fit the model perfectly.
Do you have any suggestion? Thx in advance
This has nothing to do with the actual algorithm, but the problem you are going to run into here is that weather is extremely local. One vineyard can have completely different weather than another only a mile away from it, believe or not. If you put rain gauges at each vineyard, you will find this out. To get really good results you need to have a mini weather station at each vineyard. Absent this, your best option is to use only vineyards in the immediate vicinity of the weather measurements. For example, if your data is from an airport, only use vineyards right next to the airport.
Reinforcement learning is appropriate when you can control the action. It is like a monkey pushing buttons. You push a button and get shocked, so you don't push that button again. Here you have a passive data set and cannot conduct experimental actions, so reinforcement learning does not apply.
Here you have a complex set of uncontrolled inputs, the weather data, a controlled input (harvest time), and several output parameters, sugar etc. Given that data, you want to predict what harvest time to use for some future, unknown weather pattern.
In general, what you are doing is sensitivity analysis: trying to figure out how your factors affected the outcome that occurred. The tricky part is that the outcomes may be driven by some non-obvious pattern. For example, maybe 3 weeks of drought, followed by 2 weeks of heavy rain implies the best harvest will be 65 days hence, or something like that.
So, what you have to do is featurize the data to characterize it in possible likely ways, then do a sensitivity analysis. If the analysis has a strong correlation, then you have found a solution. If it does not, then you have to find a different way to featurize the data. For example, your featurization might be number of days with rain over 2 inches, or it might be most number of days without rain, or it might be total number of days with bright sunshine. Possibly multiple features might combine to make a solution. The options are limited only by your imagination.
Of course, as I was saying above, the fly in the ointment is that your weather data will only roughly approximate the real and actual weather at the particular vineyard, so there will be noise in the data, possibly so much noise as to make getting a good result impossible.
Why you actually don't care too much about the weather
Getting back to the data, having unreliable weather information is actually not a problem, because you actually don't care too much about the weather. The reason is two-fold. First of all, the question you are trying to answer is not when to harvest the grapes, it is whether to wait to harvest or not. The vintner can always measure the current sugar of the grapes. So, he just has to decide, "Should I harvest the grapes now with sugar X%, or should I wait and possibly get a better sugar Z% later? To answer this question the real data you need is not the weather, it is a series of sugar/acidity readings taken over time. What you want to predict is whether, given a situation, the grapes will get better or whether they will get worse.
Secondly, grapevines have an optimal amount of moisture they like. If the vine gets too dry, that is bad, if it gets too wet that is bad. You cannot predict how moist a vine is from the weather. Some soils hold moisture well, others are sandy. A sandy vineyard will require more rain than a clay vineyard to have the same moisture levels. Also, the vintner can water his vineyards, completely invalidating the rainfall pattern. Therefore, weather is pretty much a non-factor.
I agree with Tyler that from a feasible standpoint weather might harm your analysis. However, I think this is for you to test and find out!- there could be some interesting data that comes out of it.
I'm not sure exactly what your test is, but a simple way to start perhaps is to make this into a classification problem using svm (or even logistic regression since you want probabilities) and use all the data as the input for the algorithm- assuming you know which years were good harvest years or not. You could even test each variable individually and see how it effects your performance. I suggest you go this way if you can just because there's massive amounts of sources on the net and people here on SO that can help you tune your algo.
When you have a handle on this, I would, as you seem to have been suggested before, try the HMM- as it will tell you which day was probably the best for the harvest. This is where the weather might hurt, but you'll come to understand more about your data from the simpler experiments.
The think I've learned about machine learning is that while there are guidelines for when to choose which algorithm its not always set in stone and you can change your question slightly and try a new approach to the problem, depending how much freedom you have to play with the data. Good luck and have fun!

Estimating a project with many unknowns

I'm working on a project with many unknowns like moving the app from one platform to another.
My original estimations are way off and there is no way I can really know for sure when this will end.
How can i deal with the inability to estimate such a project. It's not that I'm adding a button to a screen or designing a web site, or creating and app or even fixing bugs. These are not methods with bugs, these are assumptions made in the overall code, which are not correct anymore and are found step by step and each analyzed and mitigated with many more unknowns.
I happened to write a master thesis about software-estimation and there are lessons I've learned:
-1st Count, 2nd compute, 3rd judge - this means: first try to identify items in your work which are countable e.g files, classes, LOCs, UIs, etc. Then calculate using this data the effort (in person/days). Use judgement as the last ressort.
-Document your estimation! Show numbers. This minimizes your risk, thus you will present results not as your opinion, but as more or less objective figures. (In general, the more paper the cleaner the backside)
-Estimation is not a commitment. Commitment is one number, estimation is always a range - so give your estimation as a range ( use cone of uncertainty to select the range properly http://www.construx.com/Page.aspx?hid=1648 )
-Devide: Use WBS, devide your work in small pieces and estimate them separately. The granulity depents on the entire length, but at most a working-package soultn't be bigger than 10% of entire effort.
-Estimate effort first, then schedule, then costs.
-Consider estimation as support for planing, reestimate on each project phase (s. cone of uncertainty).
I would suggest the book http://www.stevemcconnell.com/est.htm which deals all these points, in particular how to deal with bosses, who try to pull a commitment from you.
Regards,
Valentin Heinitz
There's no really right answer for coming up with an accurate estimation, because there's no way to know it.
as for estimating the work itself, think about how each step can be divided into separate sub-steps, and break those down even smaller, until you can get a fair picture of as much of the work as you can, with chunks small and discreet enough to give sound estimates for. If you can, come up with both an expected time and a worst-case time, to get a range of where you could land.
Another way to approach this is to ignore the old system. It sounds like a headache. Make an estimate of scraping the old system and implementing a new one from scratch, or integrating a 3rd party, off the shelf solution. If there's a case to be made for this, it is worth at least investigating it.
Sounds like a post for postsecret not SO. :)
I would tell him that it will be done when its done, and if thats not good enough, he can learn to program and help you. Then again, I think that you might get fired, but hey that sounds like it might be better.
Tell him more or less what you told us. The project is too volatile too give an accurate estimate and the best you can do is give an estimate for a given task. As long as the number of tasks is unknown so will be the estimate. If he is at all worth his salary he would rather hear this than some made up number. This is not uncommon when dealing with a large legacy code base.
It's not that I'm adding a button to a screen or designing a web site,
or creating and app or even fixing bugs.
That is a real problem. You can not estimate what you don't have experience in. The only thing you can do is pad your estimate until you think it is a reasonable amount of time. The more unknowns you think there are the more you pad. The less you know about it the more you pad.
I read the below book and it spoke at length about accuracy vs precision. Basically you can be accurate but have a very large range. For instance you can be certain the task will be between 1 day and 1 year to complete. That is not very precise but it is really accurate.
Software Estimation Demystifying...
Some tips for estimating

Are there any well-known algorithms or computer models that computer scientists use to predict FIFA World Cup winners?

Occasionally I read news articles that mention about some computer models that computer scientists use to predict winners of some sporting events or the odds for betting which I think there must be a mathematical model behind it. I never bothered to think twice even though I am a "pseudo computer scientist" myself. With the 2010 FIFA World Cup just underway, and since I am also a "pseudo football/soccer player" myself, I just started to wonder about these calculations algorithms.
For example, I know one factor is determining the strength of opponents, so that a win against a strong opponent can count more than a win against a weak opponent. But it now kind of gets in a circular loop, or at least how does one determine the strength of a team in the first place, before that team can be considered strong or weak? If it's based on a historical data then there's no way that could be accurate, because those players of the past are no longer on the fields so their impact is none (except maybe if they become coaches like Maradona)
Anyway, long question short, if you're happen to be working in this field or have some knowledge, please shed some lights.
I know of some work, but its basis might surprise you a bit -- it's been used to predict (quite accurately) what countries do how well in the Olympics, but it's based purely on the economics of the countries in question, not looking at the individual athletes at all. I don't believe it's been used specifically to look at FIFA world cup, but I suspect it would apply about as well, or maybe even a bit better.
Some of the large Investment Banks started a competition for thier quants to write models to predict the wold cup winner.
http://kaggle.com/worldcup2010
More info on the models
http://kaggle.com/blog/2010/06/03/predicting-the-2010-fifa-world-cup-can-statisticians-outdo-the-investment-banks/
There's been some modeling to select horse racing winners using logit models here. The general principle can be applied to predicting which teams advance to the Round of 16 and subsequent rounds. Horse racing is at least, if not more, complex with regard to the number of variables that have a statistically significant effect on the outcome. For instance, in the author's model, weight, win rate, jockey characteristics, speed, post position, distance, and winnings were all significant variables. The authors didn't have access to "trip handicapping" at the time which has proven to be another important effect.
Reading this paper might help generate some thoughts around handicapping FIFA.
To tanascius point, developing a predictive model is the first step. As the authors further explain, developing a betting strategy based on the results is a different problem that's based, in part, on the accuracy of the model.
One guy has been using googles page-rank for sports. Not sure why he felt the need to rename it:
http://www.physorg.com/news180094320.html
I found that like by a quick search for using page-rank for sports because I realized it solves the circular references in rankings. Was curious if anyone had tried it, and there it was.
BTW, anyone who can make accurate predictions for things you can bet on is not publishing their methods or results. They should be making money instead.
I think there's too much to take into consideration:
Injured players, form players, form teams, pressure on teams, rivalries, weather, home advantage, past meetings, formations, team styles, expectations....
Dunno if this applies to the FIFA soccer video games, but I know that for the Super Bowl (american football, for those who dont know) they use the latest version of Madden to predict the winner.
Not very scientific, I suppose, but its there.

Predictive "blood glucose" algorithm?

I'm writing an app that lets a diabetic user enter his/her "blood glucose" readings, and then charts them on a graph over time from left to right. Since the blood readings will be done only several times a day, an algorithm would be handy to:
a) fill in the gaps on the graph between readings (curves would be more realistic than jerky lines) and allow a more accurate "blood glucose level" daily average
b) roughly predict what will happen in the future (if the user eats nothing that will affect his blood levels)
I suck at calculus. I'm hoping someone here knows a library for this stuff? I'm hoping someone knows of an algorithm that has been tailored for this specific problem already (e.g.: where someone has compared it to real data from diabetics)
Disclaimer: I am very aware that any such algorithm would vary wildly depending on the user. I'm just looking to improve on straight angular lines. Regardless of the diabetic, there is a limit to the rate that blood sugars can rise and fall.
I'm using Javascript, but as it's just math, I could port it from C, Java or whatever.
Blood sugar behavior is very complicated. It is affected by
Current blood sugar (complicated by the possible presence of ketones if the patient is hyperglycemic)
recent food out to several hours depending on the type and how much
recent fast acting insulin (with variety and patient dependent reaction profiles between 45 minutes and two hours long. Oh, and delivery mechanism)
long-acting insulin out past 12 hours (again patient and variety dependent)
activity levels
stress levels
illness
basal insulin rate if the patient wears a pump
ad nauseum
Very hard problem. Any heuristic---any heuristic---you chose would be highly misleading. So short answer:
Don't do it.
This comes, in part, from having compared a diabetic's 24-hour continous glucose log with the ~10 finger pricks taken during the same time. I.e. my suggestion is data driven.
Edit: Evidently I didn't make myself clear.
You can't even get close.
Nothing you can do with finger prick data can be remotely reliable.
Connecting the dots with any lines (even straight segments) is just plain wrong. It doesn't reflect reality. Not even a little bit.
I'm an experimental particle physicist. Complicated data sets are what I do. There is a diabetic in my life (did you guess?). This matters to me.
But I've seen the high frequency data logs, side-by-side with a log of the days finger-pricks, exercise, food, and insulin.
If you could get every-fifteen-minutes data, I'd say go ahead and use a spline. It won't be dangerously misleading. But, if you have 6-10 measurements across the day, you know nothing.
Good news: continuous monitoring is coming down in price. It's out of the lab and available with some pumps even now.
For those who aren't familiar with this: compliant diabetic patients do (results of extremely unscientific polling) 4-6+ glucose tests a day as a matter of course, and several additional ones in the 1-2 hours following any unexpected excursion (they get physical symptoms that allow them to detect severe excursions).
This serves to give the patient a rough idea of how they are doing at controlling their glucose levels, but they also go to a lab to get a Hemoglobin A1C drawn every quarter (or so). The A1C result is dependent mostly on their average blood glucose.
I've talked to people who clocked in a 80-110 (quite favorable numbers) four times a day for months, and got back an A1C suggesting an average above 150 (not desirable at all). Presumable they were going high in the night. And I've heard similar stories from people who we probably going low---very low---in their sleep.
The lesson is:
Finger prick readings have their place, but don't try to extrapolate them to times not well sampled.
If you want to do just a straight fit of the data to make things easier to view then something like what Charlie Martin recommended would likely work well. However, as noted by dmckee this data really wouldn't mean anything.
What you are trying to do is actually more in line with pharmacokenetics which is an entire scientific study in and of itself. In this case I'm not even sure it would entirely apply except in the case of Type I Diabetes as most of what I know about pharamcokenetics only applies drug studies, but if something is being produced by the body then you are likely looking at entirely different types of analysis. If you are interested in the subject then there are quite a few book previews on Google Books if you do a search for "pharmacokienetics" but due to the nature of the subject they are very math heavy and assume that you have an understanding of chemistry and biology as well.
okay, you're going to be looking for some fitted curve. The thing with that is that for n points there are fit polynomials up to order... n-1 I think. It's been a while. Yep. by golly, I'm right. The common thing when you have lots of points and don't wants a complicated function (which you don't) is to use a least-squares approximation.
probably the best thing is to look for a canned routine you can use; these exist in most stats packages. Give us a little more detail on the environment you want and we might be able to point you more closely to something suitable.
This is most likely not going to work but Artificial Neural Networks may, and i repeat may be able to get something out of a good data set. By good, i mean like weeks or months of continuous recording, and even then i wouldn't trust the data set unless i had very good reason to. I also don't think you'll get predictive data out of it, but it may depend on how you implement it. Overall if you were to do this it would seem to be more of a hobby thing to see if it even even come close, like "oh neat i got a neural network to within X amount of accuracy". Again, i must stress, don't use this in any sort of production situations or anywhere where it could possibly hurt or kill someone!

How to design an approximate solution algorithm

I want to write an algorithm that can take parts of a picture and match them to another picture of the same object.
For example, If I gave the computer a picture of a vase and a picture of a scene with the vase in it, I'd expect it to determine where in the image the vase is.
How would I begin to develop an algorithm like this?
The final usage for this algorithm will be an application that for example with a picture of somebody's face could tell if they were in a crowd of people. This algorithm would eventually be applied to video streams.
edit: I'm not expecting an actual solution to this problem as I don't hope to solve it anytime soon. The real question was how do you define something like this to a computer so that you could make an algorithm to do it.
Thanks
A former teacher of mine wrote his doctorate thesis on a similar sort of problem, except his input was a detailed 3D model of something, which he would use to find that object in 2D images. This is a VERY non-trivial problem, there is no single 'answer', certainly nothing that would fit the Stack Overflow format.
My best answer: gather a ton of money and hire a very experienced programmer.
Best of luck to you.
The first problem you describe and the second are both quite different.
A major part of each is solved by the numerous machine vision libraries available. You may need a combination of techniques to achieve any success at either task.
In the first one, you would need something that generically recognizes objects. Probably i'd use a number of algorithms in concert to identify the foreground object in the model image and then do some kind of weighted comparison of the partitioned target image.
In the second case, examining faces, is a much more difficult problem relative to the general recognizer above. Faces all look the same, or nearly so. The things that a general recognizer would notice aren't likely to be good for differentiating faces. You need an algorithm already tuned to facial recognition. Fortunately this is a rapidly maturing field and you can probably do this as well as the first case, but with a different set of functions.
The simple answer is, find a mathematical way to describe faces, that can account for angles and partial missing data, then refine and teach it.
Apparently apple has done something like this, however, it still makes mistakes and has to be taught as it moves forward.
I expect it will be more about the math, than about the programming.
I think you will find this to be quite a challenge. This is an extremely difficult problem and is one of the many areas of computing that fall under the domain of artificial intelligence (AI). Facial recognition would certainly be the most popular variant of this problem and in spite of what you may read in the media, any claimed success are not what they are made out to be. I think the closest solutions involve neural nets and they require very clear and carefully selected images usually.
You could try reading here though. Good luck!

Resources