What is the Pinterest's popular pins algorithm? - algorithm

How they are counting whether a pin is popular or not? I had something like this in mind:
{reposts}*150+{comments}*100+{likes}*50-{time}/150

They probably won't tell anyone outside...
But from my experience, popularity algorithms start off like the one you suggested, but tend to be refined and get more complex over time. Every similar algorithm I have been developing so far kept evoluting for months even after go live. As people try to trick the algorithm into making their own stuff more popular than other stuff, additional constraints are added to avoid this.
Most probably there will be more factors involved, e.g.
clicks
score depending on user score of the users who took action (reposted, commented, ...), i.e. commented twice by "VIP" users is more popular than commented twice by newbies
...

Related

What Machine Learning algorithm would be appropriate?

I am working on a predictor for learning the most likely period for grape harvesting, depending on weather and on the characteristics of grape, namely sugar level, Ph, acidity. I've got two datasets and I am thinking of how to merge them together: one is the pre-harvest analysis data of some Italian vineyards in the 2003-2013 period, the other is the weather on that decade. What I want to do is learning from my samples when to harvest, given a range for the optimal sugar level, Ph and acidity, and given a weather forecast.
I thought that some Reinforcement Learning approach could work. Since the pre-harvest analysis are done about 5 times during the grape maturation period, I thought that those could be states I step in, while the weather conditions could be the "probabilities" of going from a state to another.
Yet I am not sure of what algorithm would be the best as every state and every "probability" depends on several variables. I was told that Hidden Markov Model would work, but it seems to me that my problem doesn't fit the model perfectly.
Do you have any suggestion? Thx in advance
This has nothing to do with the actual algorithm, but the problem you are going to run into here is that weather is extremely local. One vineyard can have completely different weather than another only a mile away from it, believe or not. If you put rain gauges at each vineyard, you will find this out. To get really good results you need to have a mini weather station at each vineyard. Absent this, your best option is to use only vineyards in the immediate vicinity of the weather measurements. For example, if your data is from an airport, only use vineyards right next to the airport.
Reinforcement learning is appropriate when you can control the action. It is like a monkey pushing buttons. You push a button and get shocked, so you don't push that button again. Here you have a passive data set and cannot conduct experimental actions, so reinforcement learning does not apply.
Here you have a complex set of uncontrolled inputs, the weather data, a controlled input (harvest time), and several output parameters, sugar etc. Given that data, you want to predict what harvest time to use for some future, unknown weather pattern.
In general, what you are doing is sensitivity analysis: trying to figure out how your factors affected the outcome that occurred. The tricky part is that the outcomes may be driven by some non-obvious pattern. For example, maybe 3 weeks of drought, followed by 2 weeks of heavy rain implies the best harvest will be 65 days hence, or something like that.
So, what you have to do is featurize the data to characterize it in possible likely ways, then do a sensitivity analysis. If the analysis has a strong correlation, then you have found a solution. If it does not, then you have to find a different way to featurize the data. For example, your featurization might be number of days with rain over 2 inches, or it might be most number of days without rain, or it might be total number of days with bright sunshine. Possibly multiple features might combine to make a solution. The options are limited only by your imagination.
Of course, as I was saying above, the fly in the ointment is that your weather data will only roughly approximate the real and actual weather at the particular vineyard, so there will be noise in the data, possibly so much noise as to make getting a good result impossible.
Why you actually don't care too much about the weather
Getting back to the data, having unreliable weather information is actually not a problem, because you actually don't care too much about the weather. The reason is two-fold. First of all, the question you are trying to answer is not when to harvest the grapes, it is whether to wait to harvest or not. The vintner can always measure the current sugar of the grapes. So, he just has to decide, "Should I harvest the grapes now with sugar X%, or should I wait and possibly get a better sugar Z% later? To answer this question the real data you need is not the weather, it is a series of sugar/acidity readings taken over time. What you want to predict is whether, given a situation, the grapes will get better or whether they will get worse.
Secondly, grapevines have an optimal amount of moisture they like. If the vine gets too dry, that is bad, if it gets too wet that is bad. You cannot predict how moist a vine is from the weather. Some soils hold moisture well, others are sandy. A sandy vineyard will require more rain than a clay vineyard to have the same moisture levels. Also, the vintner can water his vineyards, completely invalidating the rainfall pattern. Therefore, weather is pretty much a non-factor.
I agree with Tyler that from a feasible standpoint weather might harm your analysis. However, I think this is for you to test and find out!- there could be some interesting data that comes out of it.
I'm not sure exactly what your test is, but a simple way to start perhaps is to make this into a classification problem using svm (or even logistic regression since you want probabilities) and use all the data as the input for the algorithm- assuming you know which years were good harvest years or not. You could even test each variable individually and see how it effects your performance. I suggest you go this way if you can just because there's massive amounts of sources on the net and people here on SO that can help you tune your algo.
When you have a handle on this, I would, as you seem to have been suggested before, try the HMM- as it will tell you which day was probably the best for the harvest. This is where the weather might hurt, but you'll come to understand more about your data from the simpler experiments.
The think I've learned about machine learning is that while there are guidelines for when to choose which algorithm its not always set in stone and you can change your question slightly and try a new approach to the problem, depending how much freedom you have to play with the data. Good luck and have fun!

How the computer knows "Recommended for You"?

Recently, I found several web site have something like : "Recommended for You", for example youtube, or facebook, the web site can study my using behavior, and recommend some content for me... ...I would like to know how they analysis this information? Is there any Algorithm to do so? Thank you.
Amazon and Netflix (among others) use a technique called Collaborative filtering to suggest things you might like based on the likes/dislikes of others who have made purchases and selections similar to yours.
Is there any Algorithm to do so?
Yes
Yes. One fairly common one is to look at things you've selected in the past, find other people who've made those selections, then find the other selections most common among those other people, and guess that you're likely to be interested in those as well.
Yup there are lots of algorithms. Things such as k-nearest neighbor: http://en.wikipedia.org/wiki/K-nearest_neighbor_algorithm.
Here is a pretty good book on the subject that covers making these sorts of systems along with others: http://www.amazon.com/gp/product/0596529325?ie=UTF8&tag=ianburriscom-20&linkCode=as2&camp=1789&creative=9325&creativeASIN=0596529325.
It's generally done by matching you with other users who have similar usage history / profile and then recommending other things that they've purhased/watched/whatever.
Searching for "recommendation algorithm" yields lots of papers. Most algorithms incorporate "machine learning" algorithms to determine groups of things (comedy movies, books on gardening, orchestral music, etc.). Your matching with those groups yields recommendations. Some companies use humans to classify things, too.
Such an algorithm is going to vary wildly from company to company. In many cases, it analyzes some combination of your search history, purchase history, physical location, and other factors. It probably will also compare purchases/searches amongst other people to find what those people have purchased/searched for, and recommend some of those products to you.
There are probably hundreds of these algorithms out there, but I doubt you can use any of them (that are actually good). Probably you are better off figuring it out yourself.
If you can categorize your contents (i.e. by tagging or content analysis), you can also categorize your users and their preferences.
For example: you have a video portal with 5 million videos .. 1 mio of them are tagged mostly red. If 80% of all videos watched by a user (who is defined by an IP, a persistent user account, ...) are tagged mostly red, you might want to recommend even more red videos to him. You might want to refine your recommendations by looking at his further actions: does he like your recommendations -- if so, why not give him even more, if not, try the second-best guess, maybe he's not looking for color, but for the background music ...
There's no absolute algorithm to do it, but all implementations will go into a similar direction. It's always basing on observing users, which scares me from time to time :-)
There's whole lot of algorithms tackling the issue: Wiki article. It's a Machine Learning domain problem. Computer's can be learned using two main techniques: classification and clustering. They require some datasets as input. If the dataset is informative (really holds some useful patterns) than those ML techniques can dig most of it.
Clustering could be best to use for this kind of problem. It's main usage is to find similarities among points in provided dataset. If the points are, e.g. your search history, they can be grouped together to form certain clusters. If Your search history closely relates to another, a hint can be given - picking links that are most similar to Your's.
The same comes with book recommendations - it's obvious what dataset they use: "Other people who bought this product also bought Product A, Product B,...". The key here is to match your profile to other's and use the most similar to recommend.
The computer retrieves information from the human brain with complex memory scan process, sorts it accordingly and outputs results based on what you have experienced in your life so far.

Scheduling Students to Classes

I am making a website for a side project at school where students enter the classes they need to take, what days they want or don't want classes, and when they cant have or don't want classes. The basics are there are classes, and each class has many sections at different times with different professors that a student can choose from. With the freshman level classes, there can be over 30 different sections for each class. I have the classes and sections in a mysql database and I have been coding in php.
So far I have it working but I want to make it faster. I have been reading about other scheduling problems but I am looking for specifics to what I am doing. This isn't making schedules from scratch. It is making schedules from what sections are available and ranking them based on what the students inputs. Currently for few possible sections, it runs fast. But when the possible schedules get to about 300,000, it takes around 30 seconds to compare and rank everything. I have been improving it by changing how schedules are generated but I want to faster. I switched from brute force generating to using a tree based method.
I'm not asking for homework help or for someone to do this for me. I just want to be pointed in the right direction with already existing problems and algorithms that I can learn about.
Remember the eight queens puzzle? I sure hope you do, if not, go and solve it first, then come back to your scheduling task.
You have already moved from brute force to a tree structure. Now it's time for branch and bound. Whatever you mean by "good schedules", 170000 is too much — you do not prune your tree enough. I do not think that there could be more than 20-50 really good schedules for each student, unless they take very few classes and are extremely flexible.
Try metaheuristics such as tabu search or simulated annealing.
Brute force and branch and bound don't scale up enough.
Take a look at my curriculum course example in drools planner, as defined by ITC2007.
Its probably an advanced form of your use case (not counting gui/db).
Have a look at this. It may not be exactly what you want but you can get some design ideas.

What should be considered when building a Recommendation Engine?

I've read the book Programming Collective Intelligence and found it fascinating. I'd recently heard about a challenge amazon had posted to the world to come up with a better recommendation engine for their system.
The winner apparently produced the best algorithm by limiting the amount of information that was being fed to it.
As a first rule of thumb I guess... "More information is not necessarily better when it comes to fuzzy algorithms."
I know's it's subjective, but ultimately it's a measurable thing (clicks in response to recommendations).
Since most of us are dealing with the web these days and search can be considered a form of recommendation... I suspect I'm not the only one who'd appreciate other peoples ideas on this.
In a nutshell, "What is the best way to build a recommendation ?"
You don't want to use "overall popularity" unless you have no information about the user. Instead, you want to align this user with similar users and weight accordingly.
This is exactly what Bayesian Inference does. In English, it means adjusting the overall probability you'll like something (the average rating) with ratings from other people who generally vote your way as well.
Another piece of advice, but this time ad hoc: I find that there are people where if they like something I will almost assuredly not like it. I don't know if this effect is real or imagined, but it might be fun to build in a kind of "negative effect" instead of just clumping people by similarity.
Finally there's a company specializing in exactly this called SenseArray. The owner (Ian Clarke of freenet fame) is very approachable. You can use my name if you call him up.
There is an entire research area in computer science devoted to this subject. I'd suggest reading some articles.
Agree with #Ricardo. This question is too broad, like asking "What's the best way to optimize a system?"
One common feature to nearly all existing recommendation engines is that making the final recommendation boils down to multiplying some number of matrices and vectors. For example multiply a matrix containing proximity weights between users by a vector of item ratings.
(Of course you have to be ready for most of your vectors to be super sparse!)
My answer is surely too late for #Allain but for other users finding this question through search -- send me a PM and ask a more specific question and I will be sure to respond.
(I design recommendation engines professionally.)
#Lao Tzu, I agree with you.
According to me, recommendation engines are made up of:
Context Input fed from context aware systems (logging all your data)
Logical reasoning to filter the most obvious
Expert systems that improve your subjective data over the period of time based on context inputs, and
Probabilistic reasoning to do decision-making close-to-proximity based on weighted sum of previous actions(beliefs, desires, & intentions).
P.S.
I made such recommendation engine.

A/B testing on a news site to improve relevance

If you were running a news site that created a list of 10 top news stories, and you wanted to make tweaks to your algorithm and see if people liked the new top story mix better, how would you approach this?
Simple Click logging in the DB associated with the post entry?
A/B testing where you would show one version of the algorithm togroup A and another to group B and measure the clicks?
What sort of characteristics would you base your decision on as to whether the changes were better?
A/B test seems a good start, and randomize the participants. You'll have to remember them so they never see both.
You could treat it like a behavioral psychology experiment, do a T-Test etc...
In addition to monitoring number of clicks, it might also be helpful to monitor how long they look at the story they clicked on. It's more complicated data, but provides another level of information. You would then not only be seeing if the stories you picked out grab the user's attentions, but also that the stories are able to keep it.
You could do statistical analysis (i.e. T-test like Tim suggested), but you probably won't get low enough of a standard deviation on either measure to prove significance. Although, it won't really matter: all you need is for one of the algorithms to have a higher average number of clicks and/or time spent. No need to fool around with hypothesis testing, hopefully.
Of course, there is always the option of simply asking the user if the recommendations were relevant, but that may not be feasible for your situation.

Resources