Programmatically determine the relative "popularities" of a list of items (books, songs, movies, etc) - algorithm

Given a list of (say) songs, what's the best way to determine their relative "popularity"?
My first thought is to use Google Trends. This list of songs:
Subterranean Homesick Blues
Empire State of Mind
California Gurls
produces the following Google Trends report: (to find out what's popular now, I restricted the report to the last 30 days)
http://s3.amazonaws.com/instagal/original/image001.png?1275516612
Empire State of Mind is marginally more popular than California Gurls, and Subterranean Homesick Blues is far less popular than either.
So this works pretty well, but what happens when your list is 100 or 1000 songs long? Google Trends only allows you to compare 5 terms at once, so absent a huge round-robin, what's the right approach?
Another option is to just do a Google Search for each song and see which has the most results, but this doesn't really measure the same thing

Excellent question - one song by Britney Spears, might be phenomenally popular for 2 months then (thankfully) forgotten, while another song by Elvis might have sustained popularity for 30 years. How do you quantitatively distinguish the two? We know we want to think that sustained popularity is more important than a "flash in the pan", but how to get this result?
First, I would normalize around the release date - Subterranean Homesick Blues might be unpopular now (not in my house, though), but normalizing back to 1965 might yield a different result.
Since most songs climb in popularity, level off, then decline, let's choose the area when they level off. One might assume that during that period, that the two series are stationary, uncorrelated, and normally distributed. Now you can just apply a test to determine if the means are different.
There's probably less restrictive tests to determine the magnitude of difference between two time series, but I haven't run across them yet.
Anyone?

You could search for the item on Twitter and see how many times it is mentioned. Or look it up on Amazon to see how many people have reviewed it and what rating they gave it. Both Twitter and Amazon have APIs.

There is an unoffical google trends api. See http://zoastertech.com/projects/googletrends/index.php?page=Getting+Started I have not used it but perhaps it is of some help.

I would certainly treat Google's API of "restricted".
In general, comparison functions used for sorting algorithms are very "binary":
input: 2 elements
output: true/false
Here you have:
input: 5 elements
output: relative weights of each element
Therefore you will only need a linear number of calls to the API (whereas sorting usually requires O(N log N) calls to comparison functions).
You will need exactly ceil( (N-1)/4 ) calls. That you can parallelize, though do read the user guide closely as for the number of requests you are authorized to submit.
Then, once all of them are "rated" you can have a simple sort in local.
Intuitively, in order to gather them properly you would:
Shuffle your list
Pop the 5 first elements
Call the API
Insert them sorted in the result (use insertion sort here)
Pick up the median
Pop the 4 first elements (or less if less are available)
Call the API with the median and those 4 first
Go Back to Insert until your run out of elements
If your list is 1000 songs long, that 250 calls to the API, nothing too scary.

Related

How to calculate difficulty metric?

Note: I have completely changed the original question!
I do have several texts, which consists of several words. Words are categorized into difficulty categories from 1 to 6, 1 being the easiest one and 6 the hardest (or from common to least common). However, obviously not all words can be put into these categories, because they are countless words in the english language.
Each category has twice as many words as the category before.
Level: 100 words in total (100 new)
Level: 200 words in total (100 new)
Level: 400 words in total (200 new)
Level: 800 words in total (400 new)
Level: 1600 words in total (800 new)
Level: 3200 words in total (1600 new)
When I use the term level 6 below, I mean introduced in level 6. So it is part of the 1600 new words and can't be found in the 1600 words up to level 5.
How would I rate the difficulty of an individual text? Compare these texts:
An easy one
would only consist of very basic vocabulary:
I drive a car.
Let's say these are 4 level 1 words.
A medium one
This old man is cretinous.
This is a very basic sentence which only comes with one difficult word.
A hard one
would have some advanced vocabulary in there too:
I steer a gas guzzler.
So how much more difficult is the second or third of the first one? Let's compare text 1 and text 3. I and a are still level 1 words, gas might be lvl 2, steer is 4 and guzzler is not even in the list. cretinous would be level 6.
How to calculate a difficulty of these texts, now that I've classified the vocabulary?
I hope it is more clear what I want to do now.
The problem you are trying to solve is how to quantify your qualitative data.
The search term "quantifying qualitative data" may help you.
There is no general all-purpose algorithm for this. The best way to do it will depend upon what you want to use the metric for, and what your ratings of each individual task mean for the project as a whole in terms of practical impact on the factors you are interested in.
For example if the hardest tasks are typically unsolvable, then as soon as a project involves a single type 6 task, then the project may become unsolvable, and your metric would need to reflect this.
You also need to find some way to address the missing data (unrated tasks). It's likely that a single numeric metric is not going to capture all the information you want about these projects.
Once you have understood what the metric will be used for, and how the task ratings relate to each other (linear increasing difficulty vs. categorical distinctions) then there are plenty of simple metrics that may codify this analysis.
For example, you may rate projects for risk based on a combination of the number of unknown tasks and the number of tasks with difficulty above a certain threshold. Alternatively you may rate projects for duration based on a weighted sum of task difficulty, using a default or estimated difficulty for unknown tasks.

n! combinations, how to find best one without killing computer?

I'll get straight to it. I'm working on an web or phone app that is responsible for scheduling. I want students to input courses they took, and I give them possible combinations of courses they should take that fits their requirements.
However, let's say there's 150 courses that fits their requirements and they're looking for 3 courses. That would be 150C3 combinations, right?.
Would it be feasible to run something like this in browser or a mobile device?
First of all you need a smarter algorithm which can prune the search tree. Also, if you are doing this for the same set of courses over and over again, doing the computation on the server would be better, and perhaps precomputing a feasible data structure can reduce the execution time of the queries. For example, you can create a tree where each sub-tree under a node contains nodes that are 'compatible'.
Sounds to me like you're viewing this completely wrong. At most institutions there are 1) curriculum requirements for graduation, and 2) prerequisites for many requirements and electives. This isn't a pure combinatorial problem, it's a dependency tree. For instance, if Course 201, Course 301, and Course 401 are all required for the student's major, higher numbers have the lower numbered ones as prereqs, and the student is a Junior, you should be strongly recommending that Course 201 be taken ASAP.
Yay, mathematics I think I can handle!
If there are 150 courses, and you have to choose 3, then the amount of possibilities are (150*149*148)/(3*2) (correction per jerry), which is certainly better than 150 factorial which is a whole lot more zeros ;)
Now, you really don't want to build an array that size, and you don't have to! All web languages have the idea of randomly choosing an element in an array, so you get an element in an array and request 3 random unique entries from it.
While the potential course combinations is very large, based on your post I see no reason to even attempt to calculate them. This task of random selection of k items from n-sized list is delightfully trivial even for old, slow devices!
Is there any particular reason you'd need to calculate all the potential course combinations, instead of just grab-bagging one random selection as a suggestion? If not, problem solved!
Option 1 (Time\Space costly): let the user on mobile phone browse the list of (150*149*148) possible choices, page by page, the processing is done at the server-side.
Option 2 (Simple): instead of the (150*149*148)-item decision tree, provide a 150-item bag, if he choose one item from the bag, remove it from the bag.
Option 3 (Complex): expand your decision tree (possible choices) using a dependency tree (parent course requires child courses) and the list of course already taken by the student, and his track\level.
As far as I know, most educational systems use the third option, which requires having a profile for the student.

How to calculate popularity based on some known factors

I've got a list of movies for each of which the following factors are known:
Number of people that wish to watch the movie in future
Number of people that have watched the movie
Number of people that have enjoyed the movie
Number of people that watched and disliked the movie
Number of comments on the movie
Number of page hits (directly or from search engines) for the movie page
So based on the above factors, I am looking for a way to calculate popularity for each of the movies. Is there any known formula or algorithm to calculate the popularity value in such case? Preferred algorithms are those which provide a more efficient way to update the previously calculated popularity value for each item.
There are basically infinite ways to do what you are after, depending on how important each factor is.
First, you will need to normalize the data. One way to do it is assume each feature is distributed normally, and find the standard deviation and mean of each feature. (your features are number of people watched the movie, number of people enjoyed the movie,...).
Once you have the sd (standard deviation) and mu (mean), you can easily transform the features for each movie to the standard form using norm = (value-mu)/sd.
The estimator for the mean (mu) is the simple average: sum(x_i) / n
The estimator for the standard deviation (sd) is sd = sqrt(Sum((x_i - mu)^2) / (n-1))
Once you have normalized your data, you can simply define the rating as a weighted sum, where each feature will get a boost according to how significant it is:
a1 * #watched + a2 * #liked + ....
If you don't know what the weight is, but willing to manually give grade to a set of movies, you might use supervised learning to find (a1,a2,...,an) for you using linear regression.
There is no correct answer, but I think we should try to model it as close to reality as possible.
Let's consider the following:
P1=Proportion of people who watched and enjoyed it
P2=Proportion of people who disliked the movie
P3=Proportion of people who watched and would like to see again
P4=People who will watch it later but haven't seen it yet
The number of comments simply can't tell how good a movie is, though it can tell how popular it is.Sure.You could leverage the amount of positive and negative comments if its possible to segregate so(possibly by up-votes and down-votes), or you could just use the number of comments as such(C).
Number of page hits should usually give a good indication of the popularity of the movie, so we should give it a good weight in our algorithm.Moreover we should give recent page hits more weight than say page hits of over a year ago.So try and keep the count of page hits in the last three days(N3), in the last week(N7), in the last month(N30) and in the last year(N365), and everything else(Nrest).
You come up with an algorithm using the factors I mentioned.
[Try to use weighted average and variations of Horner's rule for quick updates.Good Luck.]

How to notice unusual news activity

Suppose you were able keep track of the news mentions of different entities, like say "Steve Jobs" and "Steve Ballmer".
What are ways that could you tell whether the amount of mentions per entity per a given time period was unusual relative to their normal degree of frequency of appearance?
I imagine that for a more popular person like Steve Jobs an increase of like 50% might be unusual (an increase of 1000 to 1500), while for a relatively unknown CEO an increase of 1000% for a given day could be possible (an increase of 2 to 200). If you didn't have a way of scaling that your unusualness index could be dominated by unheard-ofs getting their 15 minutes of fame.
update: To make it clearer, it's assumed that you are already able to get a continuous news stream and identify entities in each news item and store all of this in a relational data store.
You could use a rolling average. This is how a lot of stock trackers work. By tracking the last n data points, you could see if this change was a substantial change outside of their usual variance.
You could also try some normalization -- one very simple one would be that each category has a total number of mentions (m), a percent change from the last time period (δ), and then some normalized value (z) where z = m * δ. Lets look at the table below (m0 is the previous value of m) :
Name m m0 δ z
Steve Jobs 4950 4500 .10 495
Steve Ballmer 400 300 .33 132
Larry Ellison 50 10 4.0 400
Andy Nobody 50 40 .20 10
Here, a 400% change for unknown Larry Ellison results in a z value of 400, a 10% change for the much better known Steve Jobs is 495, and my spike of 20% is still a low 10. You could tweak this algorithm depending on what you feel are good weights, or use standard deviation or the rolling average to find if this is far away from their "expected" results.
Create a database and keep a history of stories with a time stamp. You then have a history of stories over time of each category of news item you're monitoring.
Periodically calculate the number of stories per unit of time (you choose the unit).
Test if the current value is more than X standard deviations away from the historical data.
Some data will be more volatile than others so you may need to adjust X appropriately. X=1 is a reasonable starting point
Way over simplified-
store people's names and the amount of articles created in the past 24 hours with their name involved. Compare to historical data.
Real life-
If you're trying to dynamically pick out people's names, how would you go about doing that? Searching through articles how do you grab names? Once you grab a new name, do you search for all articles for him? How do you separate out Steve Jobs from Apple from Steve Jobs the new star running back that is generating a lot of articles?
If you're looking for simplicity, create a table with 50 people's names that you actually insert. Every day at midnight, have your program run a quick google query for past 24 hours and store the number of results. There are a lot of variables in this though that we're not accounting for.
The method you use is going to depend on the distribution of the counts for each person. My hunch is that they are not going to be normally distributed, which means that some of the standard approaches to longitudinal data might not be appropriate - especially for the small-fry, unknown CEOs you mention, who will have data that are very much non-continuous.
I'm really not well-versed enough in longitudinal methods to give you a solid answer here, but here's what I'd probably do if you locked me in a room to implement this right now:
Dig up a bunch of past data. Hard to say how much you'd need, but I would basically go until it gets computationally insane or the timeline gets unrealistic (not expecting Steve Jobs references from the 1930s).
In preparation for creating a simulated "probability distribution" of sorts (I'm using terms loosely here), more recent data needs to be weighted more than past data - e.g., a thousand years from now, hearing one mention of (this) Steve Jobs might be considered a noteworthy event, so you wouldn't want to be using expected counts from today (Andy's rolling mean is using this same principle). For each count (day) in your database, create a sampling probability that decays over time. Yesterday is the most relevant datum and should be sampled frequently; 30 years ago should not.
Sample out of that dataset using the weights and with replacement (i.e., same datum can be sampled more than once). How many draws you make depends on the data, how many people you're tracking, how good your hardware is, etc. More is better.
Compare your actual count of stories for the day in question to that distribution. What percent of the simulated counts lie above your real count? That's roughly (god don't let any economists look at this) the probability of your real count or a larger one happening on that day. Now you decide what's relevant - 5% is the norm, but it's an arbitrary, stupid norm. Just browse your results for awhile and see what seems relevant to you. The end.
Here's what sucks about this method: there's no trend in it. If Steve Jobs had 15,000 a week ago, 2000 three days ago, and 300 yesterday, there's a clear downward trend. But the method outlined above can only account for that by reducing the weights for the older data; it has no way to project that trend forward. It assumes that the process is basically stationary - that there's no real change going on over time, just more and less probable events from the same random process.
Anyway, if you have the patience and willpower, check into some real statistics. You could look into multilevel models (each day is a repeated measure nested within an individual), for example. Just beware of your parametric assumptions... mention counts, especially on the small end, are not going to be normal. If they fit a parametric distribution at all, it would be in the Poisson family: the Poisson itself (good luck), the overdispersed Poisson (aka negative binomial), or the zero-inflated Poisson (quite likely for your small-fry, no chance for Steve).
Awesome question, at any rate. Lend your support to the statistics StackExchange site, and once it's up you'll be able to get a much better answer than this.

Algorithm for most recently/often contacts for auto-complete?

We have an auto-complete list that's populated when an you send an email to someone, which is all well and good until the list gets really big you need to type more and more of an address to get to the one you want, which goes against the purpose of auto-complete
I was thinking that some logic should be added so that the auto-complete results should be sorted by some function of most recently contacted or most often contacted rather than just alphabetical order.
What I want to know is if there's any known good algorithms for this kind of search, or if anyone has any suggestions.
I was thinking just a point system thing, with something like same day is 5 points, last three days is 4 points, last week is 3 points, last month is 2 points and last 6 months is 1 point. Then for most often, 25+ is 5 points, 15+ is 4, 10+ is 3, 5+ is 2, 2+ is 1. No real logic other than those numbers "feel" about right.
Other than just arbitrarily picked numbers does anyone have any input? Other numbers also welcome if you can give a reason why you think they're better than mine
Edit: This would be primarily in a business environment where recentness (yay for making up words) is often just as important as frequency. Also, past a certain point there really isn't much difference between say someone you talked to 80 times vs say 30 times.
Take a look at Self organizing lists.
A quick and dirty look:
Move to Front Heuristic:
A linked list, Such that whenever a node is selected, it is moved to the front of the list.
Frequency Heuristic:
A linked list, such that whenever a node is selected, its frequency count is incremented, and then the node is bubbled towards the front of the list, so that the most frequently accessed is at the head of the list.
It looks like the move to front implementation would best suit your needs.
EDIT: When an address is selected, add one to its frequency, and move to the front of the group of nodes with the same weight (or (weight div x) for courser groupings). I see aging as a real problem with your proposed implementation, in that it requires calculating a weight on each and every item. A self organizing list is a good way to go, but the algorithm needs a bit of tweaking to do what you want.
Further Edit:
Aging refers to the fact that weights decrease over time, which means you need to know each and every time an address was used. Which means, that you have to have the entire email history available to you when you construct your list.
The issue is that we want to perform calculations (other than search) on a node only when it is actually accessed -- This gives us our statistical good performance.
This kind of thing seems similar to what is done by firefox when hinting what is the site you are typing for.
Unfortunately I don't know exactly how firefox does it, point system seems good as well, maybe you'll need to balance your points :)
I'd go for something similar to:
NoM = Number of Mail
(NoM sent to X today) + 1/2 * (NoM sent to X during the last week)/7 + 1/3 * (NoM sent to X during the last month)/30
Contacts you did not write during the last month (it could be changed) will have 0 points. You could start sorting them for NoM sent in total (since it is on the contact list :). These will be showed after contacts with points > 0
It's just an idea, anyway it is to give different importance to the most and just mailed contacts.
If you want to get crazy, mark the most 'active' emails in one of several ways:
Last access
Frequency of use
Contacts with pending sales
Direct bosses
Etc
Then, present the active emails at the top of the list. Pay attention to which "group" your user uses most. Switch to that sorting strategy exclusively after enough data is collected.
It's a lot of work but kind of fun...
Maybe count the number of emails sent to each address. Then:
ORDER BY EmailCount DESC, LastName, FirstName
That way, your most-often-used addresses come first, even if they haven't been used in a few days.
I like the idea of a point-based system, with points for recent use, frequency of use, and potentially other factors (prefer contacts in the local domain?).
I've worked on a few systems like this, and neither "most recently used" nor "most commonly used" work very well. The "most recent" can be a real pain if you accidentally mis-type something once. Alternatively, "most used" doesn't evolve much over time, if you had a lot of contact with somebody last year, but now your job has changed, for example.
Once you have the set of measurements you want to use, you could create an interactive apoplication to test out different weights, and see which ones give you the best results for some sample data.
This paper describes a single-parameter family of cache eviction policies that includes least recently used and least frequently used policies as special cases.
The parameter, lambda, ranges from 0 to 1. When lambda is 0 it performs exactly like an LFU cache, when lambda is 1 it performs exactly like an LRU cache. In between 0 and 1 it combines both recency and frequency information in a natural way.
In spite of an answer having been chosen, I want to submit my approach for consideration, and feedback.
I would account for frequency by incrementing a counter each use, but by some larger-than-one value, like 10 (To add precision to the second point).
I would account for recency by multiplying all counters at regular intervals (say, 24 hours) by some diminisher (say, 0.9).
Each use:
UPDATE `addresslist` SET `favor` = `favor` + 10 WHERE `address` = 'foo#bar.com'
Each interval:
UPDATE `addresslist` SET `favor` = FLOOR(`favor` * 0.9)
In this way I collapse both frequency and recency to one field, avoid the need for keeping a detailed history to derive {last day, last week, last month} and keep the math (mostly) integer.
The increment and diminisher would have to be adjusted to preference, of course.

Resources