What is a good algorithm for showing the most popular blog posts? - algorithm

I'm planning on developing my own plugin for showing the most popular posts, as well as counting how many times a post has been read.
But I need a good algorithm for figuring out the most popular blog post, and a way of counting the number of times a post has been viewed.
A problem I see when it comes to counting the number of times a post has been read, is to avoid counting if the same person opens the same post many times in a row, as well as avoiding web crawlers.

http://wordpress.org/extend/plugins/wordpress-popular-posts/
Comes in the form of a plugin. No muss, no fuss.

'Live' counters are easily implementable and a dime a dozen. If they become too cumbersome on high traffic blogs, the usual way is to parse webserver access logs on another server periodically and update the database. The period can vary from a few minutes to a day, depending on how much lag you deem acceptable.

There are two ways of going about this:
You could consider the individual page hits [through the Apache/IIS logs] and use that
Use Google Page rank to emphasize pages that are strongly linked to [popular posts would no longer be based on visits but on the amount of pages that link to it]

Related

What are best known algorithms/techniques for updating edges on huge graph structures like social networks?

On social networks like twitter where millions follow single account, it must be very challenging to update all followers instantly when a new tweet is posted. Similarly on facebook there are fan pages with millions of followers and we see updates from them instantly when posted on page. I am wondering what are best known techniques and algorithms to achieve this. I understand with billion accounts, they have huge data centers across globe and even if we reduce this problem for just one computer in following manner - 100,000 nodes with average 200 edges per node, then every single node update will require 200 edge updates. So what are best techniques/algorithms to optimize such large updates. Thanks!
The best way is usually just to do all the updates. You say they can be seen "instantly", but actually the updates probably propagate through the network and can take up to a few seconds to show up in followers' feeds.
Having to do all those updates may seem like a lot, but on average follower will check for updates much more often than a person being followed will produce them, and checking for updates has to be much faster.
The choices are:
Update 1 million followers, a couple times a day, within a few seconds; or
Respond to checks from 1 million followers, a couple hundred times a day, within 1/10 second or so.
There are in-between strategies involving clustering users and stuff, but usage patterns like you see on Facebook and Twitter are probably so heavily biased toward option (1) that such strategies don't pay off.

How to implement personalized feed ranking?

I have an app that aggregates various sports content (news articles, videos, discussions from users, tweets) and I'm currently working on having it so that it'll display relevant content to the users. Each post has a like button so I'm using that to determine what's popular. I'm using the reddit algorithm to have it sorted on popularity but also factor in time. However, my problem is that I want to make it more personalized for each user. Each user should see more content based on what they like. I have several factors I'm measuring:
- How many of each content they watch/click on? Ex: 60% videos and 40% articles
- What teams/players they like? If a news is about a team they like, it should be weighed more heavily
- What sport they like more? Users can follow several sports
What I'm currently doing:
For each of the factors listed above, I'll increase the popularity score by X of an article. Ex: user likes videos 70% than other content. I'll increase the score of videos by 70%.
I'm looking to see if there's better ways to do this? I've been told machine learning would be a good way but I wanted to see if there are any alternatives out there.
It sounds like what your doing is a great place to start with personalizing your users feeds.
Ranking based on popularity metrics (likes, comments, etc), recency, and in you case content type is the basis of the EdgeRank algorithm that Facebook used to use.
There are a lot of metrics that you can apply to try and boost engagement. Something
user liked post from team x, y times, so boost activity in feed by log(x) if post if is from y, boost activity if it’s newer, boost activity if it’s popular, etc… You can start to see that these EdgeRank algorithms can get a bit unwieldy rather quickly the more metrics you track. Also all the hyper-parameters that you set tend to be fixed for each user, which won’t end up with the ideal ranking algorithm for every user. Which is where machine learning techniques can come into play.
The main class of algorithms that deal with this sort of thing are often called Learning to Rank, and can be on a high level generalized into 3 categories. Collaborative filtering techniques, content based techniques, and hybrid techniques (blend of the first two)
In you case with a feed that most likely gets updated fairly frequently with new items, I would take a look at content based methods. Typically these algorithms are optimized around engagement metrics such as likelihood that the user is going to click, view, comment, or like an activity within their feed.
A little bit of self-promotion: I wrote a couple blog posts that cover some of this that you may find interesting.
https://getstream.io/blog/instagram-discovery-engine-tutorial/
https://getstream.io/blog/beyond-edgerank-personalized-news-feeds/
This can be a lot a lot to take on, so you could also take a look at using a 3rd party service like Stream (disclaimer, I do work there) who helps developers build scalable, personalized feeds.

Webpage updates detection algorithms

First of all, I'm not looking for code, just a plain discussion about approaches regarding what the subject says.
I was wondering lately on how really the best way to detect (as fast as possible) changes to website pages, assuming I have 100K websites, each has an unknown amount of pages, do a crawler really needs to visit each and every one of them once in a while?
Unless they have RSS feeds (which you would still need to pull to see if they have changed) there really isn't anyway to find out when the site has changed except by going to it and checking. However you can do some smart things to be more efficient. After you have been checking on the site for a while you can build a prediction model of when they tend to update. For example: this news site updates every 2-3 hours but that blog only makes about a post a week. This can save you many checks because the majority of pages don't actually update that often. Google does this to help with its pulling. One simple algorithm which will work for this (depending on how cutting edge you need your news to be) is the following of my own design based on binary search:
Start each site off with a time interval ~ 1 day
Visit the sites when that time hits and check changes
if something has changed
halve the time for that site
else
double the time for that site
If after many iterations you find it hovering around 2-3 numbers
fix the time on the greater of the numbers
Now this is a simple algorithm for finding which times are right for checking but you can probably do something more effective if you parse the text and see patterns in times when the updates were actually posted.

Is This Idea for Loading Online Content in Bulk Feasible?

I devised an idea a long time ago and never got around to implementing it, and I would like to know whether it is practical in that it would work to significantly decrease loading times for modern browsers. It relies on the fact that related tasks are often done more quickly when they are done together in bulk, and that the browser could be downloading content on different pages using a statistical model instead of being idle while the user is browsing. I've pasted below an excerpt from what I originally wrote, which describes the idea.
Description.
When people visit websites, I
conjecture that that a probability
density function P(q, t), where q is a
real-valued integer representing the
ID of a website and t is another
real-valued, non-negative integer
representing the time of the day, can
predict the sequence of webpages
visited by the typical human
accurately enough to warrant
requesting and loading the HTML
documents the user is going to read in
advance. For a given website, have the
document which appears to be the "main
page" of the website through which
users access the other sections be
represented by the root of a tree
structure. The probability that the
user will visit the root node of the
tree can be represented in two ways.
If the user wishes to allow a process
to automatically execute upon the
initialization of the operating system
to pre-fetch webpages from websites
(using a process elaborated later)
which the user frequently accesses
upon opening the web browser, the
probability function which determines
whether a given website will have its
webpages pre-fetched can be determined
using a self-adapting heuristic model
based on the user's history (or by
manual input). Otherwise, if no such
process is desired by the user, the
value of P for the root node is
irrelevant, since the pre-fetching
process is only used after the user
visits the main page of the website.
Children in the tree described earlier
are each associated with an individual
probability function P(q, t) (this
function can be a lookup table which
stores time-webpage pairs). Thus, the
sequences of webpages the user visits
over time are logged using this tree
structure. For instance, at 7:00 AM,
there may be a 71/80 chance that I
visit the "WTF" section on Reddit
after loading the main page of that
site. Based on the values of the
p> robability function P for each node
in the tree, chains of webpages
extending a certain depth from the
root node where the net probability
that each sequence is followed, P_c,
is past a certain threshold, P_min,
are requested upon the user visiting
the main page of the site. If the
downloading of one webpage finishes
before before another is processed, a
thread pool is used so that another
core is assigned the task of parsing
the next webpage in the queue of
webpages to be parsed. Hopefully, in
this manner, a large portion of those
webpages the user clicks may be
displayed much more quickly than they
would be otherwise.
I left out many details and optimizations since I just wanted this to be a brief description of what I was thinking about. Thank you very much for taking the time to read this post; feel free to ask any further questions if you have them.
Interesting idea -- and there have been some implementations for pre-fetching in browsers though without the brains you propose -- which could help alot. I think there are some flaws with this plan:
a) web page browsing, in most cases, is fast enough for most purposes.
b) bandwidth is becoming metered -- if I'm just glancing at the home page, do I as a user want to pay to serve the other pages. Moreover, in the cases where this sort of thing could be useful (eg--slow 3g connection), bandwidth tends to be more tightly metered. And perhaps not so good at concurrency (eg -- CDMA 3g connections).
c) from a server operator's point of view, I'd rather just serve requested pages in most cases. Rendering pages that don't ever get seen costs me cycles and bandwidth. If you are like alot of folks and on some cloud computing platform, you are paying by the cycle and the byte.
d) would require re-building lots of analytics systems, many of which still operate on the theory of request == impression
Or, the short summary is that there really isn't a need to pre-sage what people would view in order to speed serving and rendering pages. Now, places where something like this could be really useful would be in the "hey, if you liked X you probably liked Y" and then popping links and such to said content (or products) to folks.
Windows does the same thing with disk access - it "knows" that you are likely to start let's say Firefox at a certain time and preloads it.
SuperFetch also keeps track of what times of day those applications are used, which allows it to intelligently pre-load information that is expected to be used in the near future.
http://en.wikipedia.org/wiki/Windows_Vista_I/O_technologies#SuperFetch
Pointing out existing tech that does similar thing:
RSS readers load feeds in background, with assumption that user will want to read them sooner or later. There's no probability function that selects feeds for download though, user explicitly selects them
Browser start page and pinned tabs: these load as you start your browser, again user gets to select which websites are worth having around all the time
Your proposal boils down to predicting where user is most likely to click next given current website and current time of day. I can think of few other factors that play role here:
what other websites are open in tabs ("user opened song in youtube, preload lyrics and guitar chords!")
what other applications are running ("user is looking at invoice in e-mail reader, preload internet bank")
which person is using the computer--use webcam to recognize faces, know which sites each one frequents

How to continually filter interesting data to the user?

Take an example of a question/answer site with a 'browse' slideshow that will show one question/answer page at a time. The user clicks the 'next' button and a new question/answer is presented to him.
I need to decide which pages should be returned each time the user clicks 'next'. Some things I don't want and reasons why:
Showing 'newest' questions in descending order:
Say 100 questions get entered, then no user is going to click thru to the 100th item and it'll never get any responses. It also means if no new questions were asked recently, every time the user visits the site, he'll see the same repeated stale data.
Showing 'most active' questions, judged by a lot of suggested answers/comments:
This won't return those questions that have low activity, which are exactly the ones that need more visibility
Showing 'low activity' questions, judged by not a lot of answers/comments:
Once a question starts getting activity, it'll stop being shown. This will stymie the activity on a question, when I'd really like to encourage discussion.
I feel that a mix of these would work well, but I'm unsure of how to judge which pages should be returned. I'll stress that I don't want the user to have to choose which category of items to view (like how SO has the unanswered/active/newest filters).
Are there any common practices for doing this, or any ideas for how it might be done?
Thanks!
Edit:
Here's what I'm leaning towards so far, with much thanks to Tim's comment:
So far I'm thinking of ranking pages by Activity Count / View Count, where activity is incremented each time a user performs an action on a page, like a vote, comment, answer, etc. View will get incremented for each page every time a person views the page.
I'll then rank all pages by their activity/view ratio and show pages with a high ratio more often. This way pages with low activity and high views will be shown the least, while ones with high activity and low views will be shown most frequently. Low activity/low views and high activity/high views will be somewhere in the middle I imagine, but I'll have to keep a close eye on this in the beta release. I also plan on storing which pages the user has viewed in the past 24 hours so they won't see any repeats in the slideshow in a given day.
Some ideas for preventing 'stale' data (if all the above doesn't seem to prevent it): Perhaps run a cron job which will periodically check for pages that haven't been viewed recently and boost their ratio to put them at the top.
As I see it, you are touching upon two interesting questions:
How to define that a post is interesting to a user: Here you could take a weighted combination of various factors that could contribute to interestingness of a post. Amount of activity, how fresh the entry is, if you have a way of knowing that the item matches users interest etc etc. You could pick the weights based on intuition and see how well the result matches your expectation. If you have the time and inclination, you could collect data on how well your users respond to the entries and try to learn the optimum weights for each factor using machine learning techniques.
How to give new posts a chance, otherwise known as exploration-exploitation tradeoff.
BAsically, if you just keep going to known interesting entries then you will maximize instantaneous user happiness, but you will never learn about new interesting stuff hence, overall your users are unhappy.
This is a very well studies problem, and depending upon how much you want to get into it, you can read up literature on things like k-armed bandit problems.
But a simple solution would be to not pick the entry with the highest score, but pick the entry based on a probability distribution such that high score entries have higher probability of showing up. This way most of the times you show interesting stuff, but every post has a chance to show up occasionally.

Resources