Algorithms/Techniques for rating website (PageRank aside) - algorithm

I'm looking for algorithms/techniques that are able to present the importance of a a single webpage. Leaving PageRank aside, are there any other methods of doing such a rating based on content, structure and hyperlinks with each other?
I'm not only talking about the connection from www.foo.com to www.bar.com as PageRank does but also from www.foo.com/bar to www.foo.com/baz and so on (beside the fact of adapting PageRank for these needs)
How do I "define" importance: I think of importance in this context as "how relevant is this side to the user, as well as how important it is to the rest of the site".
E.g. A christmas raffle is announced on the startpage with only a single link leading to this site is more important to the user as well as to the site. An imprint, which has a link from every site (since it's mostly somewhere in the footer) is not important although it has many links to it. Imprint is also not important to the site as a "unit" since it doesn't give any real value for the page's puprpose (= giving information, selling products, a general service, etc)

There is also SALSA which is more stable then HITS [so it suffers less from spam].
Since you are also interested in context of pages, you might want to have a look on Haveliwala's work on topic sensitive page rank

Another famous algorithm is the Hubs and Authorities (HITS). Basically you classify your page as either a Hub (a page having a lot of outbound links) and Authorities (a page having a lot of inbound links).
But you should really define what you mean by importance. What does really important mean ? PageRank defines it with respect to the inbound links. That is PageRank definitions.
If you define important as having a photo, because you like photography. Then you could come up with an important metric like number of photos in the page. Another metric could be the number of inbound links from a photography site (like flickr.com, 500px, ...)
Using your definition of important, you could use `1-(the number of inbound links divided by the number of pages on the site). This gives you a number between 0 and 1. 0 means not important and 1 means important.
Using this metric your imprint, which appears on all the pages of the site, has importance of 0. Your Christmas sale page, which has only one link to it, has importance almost 1

Related

How do search engines evaluate the usefulness of their results?

A search engine returns 2 results A and B for search "play game XYZ".
People who click on result B spend a much longer time and play a lot more XYZ games at site B, while clickers to site A leave the site after a short visit.
I'd imagine site B is a better search result than site A even though it's possible that site B is new and has few links to it in comparison with A. I assume better search engines would take this into account.
My question is, if so, how do they keep track of usage pattern of a search result in the real world?
There are two issues here:
If a user plays game B a lot - he is likely to write and link about it (blogs, reviews, social networks,....) If he does it, the static score of B will raise. This is a part of Page Rank algorithm, that gives the static score of each page and helps the Search Engine decide which page is better.
There is another factor that some Search Engines use: If a user clicked a page, but searched the same/similar query very soon after it - it is likely he did not found what he was after. In this case, the search engine can assume the page is not a good fit ti the query and reduce the score given to this page.
Other then it, the SE cannot really know "how much time you played a game" (unless you revisit it multiple times, by researching the query - and not navigate to it directly, and then it can use the #times the user navigated to the game by searching)
Search engines get results with algorithms like PageRank, which sort websites in it's database by how many sites link to it.
From Wikipedia:
that assigns a numerical weighting to each element of a hyperlinked
set of documents, such as the World Wide Web, with the purpose of
"measuring" its relative importance within the set.
As more sites like to it, it's reputation is assumed to raise, thus it's ranking.
Other methods can be used as well, like search engines can also detect how much time spent on a website, through there external services. Like Google tracks time spent through their widely used tracking/web stat service Google Analytics.
As the other answers mention, another method is detecting a site's relevance to the query is if a similar search is conducted within a short time-frame from the previous. This can indicate if the user actually found what they were looking for on the previous site.

Is This Idea for Loading Online Content in Bulk Feasible?

I devised an idea a long time ago and never got around to implementing it, and I would like to know whether it is practical in that it would work to significantly decrease loading times for modern browsers. It relies on the fact that related tasks are often done more quickly when they are done together in bulk, and that the browser could be downloading content on different pages using a statistical model instead of being idle while the user is browsing. I've pasted below an excerpt from what I originally wrote, which describes the idea.
Description.
When people visit websites, I
conjecture that that a probability
density function P(q, t), where q is a
real-valued integer representing the
ID of a website and t is another
real-valued, non-negative integer
representing the time of the day, can
predict the sequence of webpages
visited by the typical human
accurately enough to warrant
requesting and loading the HTML
documents the user is going to read in
advance. For a given website, have the
document which appears to be the "main
page" of the website through which
users access the other sections be
represented by the root of a tree
structure. The probability that the
user will visit the root node of the
tree can be represented in two ways.
If the user wishes to allow a process
to automatically execute upon the
initialization of the operating system
to pre-fetch webpages from websites
(using a process elaborated later)
which the user frequently accesses
upon opening the web browser, the
probability function which determines
whether a given website will have its
webpages pre-fetched can be determined
using a self-adapting heuristic model
based on the user's history (or by
manual input). Otherwise, if no such
process is desired by the user, the
value of P for the root node is
irrelevant, since the pre-fetching
process is only used after the user
visits the main page of the website.
Children in the tree described earlier
are each associated with an individual
probability function P(q, t) (this
function can be a lookup table which
stores time-webpage pairs). Thus, the
sequences of webpages the user visits
over time are logged using this tree
structure. For instance, at 7:00 AM,
there may be a 71/80 chance that I
visit the "WTF" section on Reddit
after loading the main page of that
site. Based on the values of the
p> robability function P for each node
in the tree, chains of webpages
extending a certain depth from the
root node where the net probability
that each sequence is followed, P_c,
is past a certain threshold, P_min,
are requested upon the user visiting
the main page of the site. If the
downloading of one webpage finishes
before before another is processed, a
thread pool is used so that another
core is assigned the task of parsing
the next webpage in the queue of
webpages to be parsed. Hopefully, in
this manner, a large portion of those
webpages the user clicks may be
displayed much more quickly than they
would be otherwise.
I left out many details and optimizations since I just wanted this to be a brief description of what I was thinking about. Thank you very much for taking the time to read this post; feel free to ask any further questions if you have them.
Interesting idea -- and there have been some implementations for pre-fetching in browsers though without the brains you propose -- which could help alot. I think there are some flaws with this plan:
a) web page browsing, in most cases, is fast enough for most purposes.
b) bandwidth is becoming metered -- if I'm just glancing at the home page, do I as a user want to pay to serve the other pages. Moreover, in the cases where this sort of thing could be useful (eg--slow 3g connection), bandwidth tends to be more tightly metered. And perhaps not so good at concurrency (eg -- CDMA 3g connections).
c) from a server operator's point of view, I'd rather just serve requested pages in most cases. Rendering pages that don't ever get seen costs me cycles and bandwidth. If you are like alot of folks and on some cloud computing platform, you are paying by the cycle and the byte.
d) would require re-building lots of analytics systems, many of which still operate on the theory of request == impression
Or, the short summary is that there really isn't a need to pre-sage what people would view in order to speed serving and rendering pages. Now, places where something like this could be really useful would be in the "hey, if you liked X you probably liked Y" and then popping links and such to said content (or products) to folks.
Windows does the same thing with disk access - it "knows" that you are likely to start let's say Firefox at a certain time and preloads it.
SuperFetch also keeps track of what times of day those applications are used, which allows it to intelligently pre-load information that is expected to be used in the near future.
http://en.wikipedia.org/wiki/Windows_Vista_I/O_technologies#SuperFetch
Pointing out existing tech that does similar thing:
RSS readers load feeds in background, with assumption that user will want to read them sooner or later. There's no probability function that selects feeds for download though, user explicitly selects them
Browser start page and pinned tabs: these load as you start your browser, again user gets to select which websites are worth having around all the time
Your proposal boils down to predicting where user is most likely to click next given current website and current time of day. I can think of few other factors that play role here:
what other websites are open in tabs ("user opened song in youtube, preload lyrics and guitar chords!")
what other applications are running ("user is looking at invoice in e-mail reader, preload internet bank")
which person is using the computer--use webcam to recognize faces, know which sites each one frequents

How does the Amazon Recommendation feature work?

What technology goes in behind the screens of Amazon recommendation technology? I believe that Amazon recommendation is currently the best in the market, but how do they provide us with such relevant recommendations?
Recently, we have been involved with similar recommendation kind of project, but would surely like to know about the in and outs of the Amazon recommendation technology from a technical standpoint.
Any inputs would be highly appreciated.
Update:
This patent explains how personalized recommendations are done but it is not very technical, and so it would be really nice if some insights could be provided.
From the comments of Dave, Affinity Analysis forms the basis for such kind of Recommendation Engines. Also here are some good reads on the Topic
Demystifying Market Basket Analysis
Market Basket Analysis
Affinity Analysis
Suggested Reading:
Data Mining: Concepts and Technique
It is both an art and a science. Typical fields of study revolve around market basket analysis (also called affinity analysis) which is a subset of the field of data mining. Typical components in such a system include identification of primary driver items and the identification of affinity items (accessory upsell, cross sell).
Keep in mind the data sources they have to mine...
Purchased shopping carts = real money from real people spent on real items = powerful data and a lot of it.
Items added to carts but abandoned.
Pricing experiments online (A/B testing, etc.) where they offer the same products at different prices and see the results
Packaging experiments (A/B testing, etc.) where they offer different products in different "bundles" or discount various pairings of items
Wishlists - what's on them specifically for you - and in aggregate it can be treated similarly to another stream of basket analysis data
Referral sites (identification of where you came in from can hint other items of interest)
Dwell times (how long before you click back and pick a different item)
Ratings by you or those in your social network/buying circles - if you rate things you like you get more of what you like and if you confirm with the "i already own it" button they create a very complete profile of you
Demographic information (your shipping address, etc.) - they know what is popular in your general area for your kids, yourself, your spouse, etc.
user segmentation = did you buy 3 books in separate months for a toddler? likely have a kid or more.. etc.
Direct marketing click through data - did you get an email from them and click through? They know which email it was and what you clicked through on and whether you bought it as a result.
Click paths in session - what did you view regardless of whether it went in your cart
Number of times viewed an item before final purchase
If you're dealing with a brick and mortar store they might have your physical purchase history to go off of as well (i.e. toys r us or something that is online and also a physical store)
etc. etc. etc.
Luckily people behave similarly in aggregate so the more they know about the buying population at large the better they know what will and won't sell and with every transaction and every rating/wishlist add/browse they know how to more personally tailor recommendations. Keep in mind this is likely only a small sample of the full set of influences of what ends up in recommendations, etc.
Now I have no inside knowledge of how Amazon does business (never worked there) and all I'm doing is talking about classical approaches to the problem of online commerce - I used to be the PM who worked on data mining and analytics for the Microsoft product called Commerce Server. We shipped in Commerce Server the tools that allowed people to build sites with similar capabilities.... but the bigger the sales volume the better the data the better the model - and Amazon is BIG. I can only imagine how fun it is to play with models with that much data in a commerce driven site. Now many of those algorithms (like the predictor that started out in commerce server) have moved on to live directly within Microsoft SQL.
The four big take-a-ways you should have are:
Amazon (or any retailer) is looking at aggregate data for tons of transactions and tons of people... this allows them to even recommend pretty well for anonymous users on their site.
Amazon (or any sophisticated retailer) is keeping track of behavior and purchases of anyone that is logged in and using that to further refine on top of the mass aggregate data.
Often there is a means of over riding the accumulated data and taking "editorial" control of suggestions for product managers of specific lines (like some person who owns the 'digital cameras' vertical or the 'romance novels' vertical or similar) where they truly are experts
There are often promotional deals (i.e. sony or panasonic or nikon or canon or sprint or verizon pays additional money to the retailer, or gives a better discount at larger quantities or other things in those lines) that will cause certain "suggestions" to rise to the top more often than others - there is always some reasonable business logic and business reason behind this targeted at making more on each transaction or reducing wholesale costs, etc.
In terms of actual implementation? Just about all large online systems boil down to some set of pipelines (or a filter pattern implementation or a workflow, etc. you call it what you will) that allow for a context to be evaluated by a series of modules that apply some form of business logic.
Typically a different pipeline would be associated with each separate task on the page - you might have one that does recommended "packages/upsells" (i.e. buy this with the item you're looking at) and one that does "alternatives" (i.e. buy this instead of the thing you're looking at) and another that pulls items most closely related from your wish list (by product category or similar).
The results of these pipelines are able to be placed on various parts of the page (above the scroll bar, below the scroll, on the left, on the right, different fonts, different size images, etc.) and tested to see which perform best. Since you're using nice easy to plug and play modules that define the business logic for these pipelines you end up with the moral equivalent of lego blocks that make it easy to pick and choose from the business logic you want applied when you build another pipeline which allows faster innovation, more experimentation, and in the end higher profits.
Did that help at all? Hope that give you a little bit of insight how this works in general for just about any ecommerce site - not just Amazon. Amazon (from talking to friends that have worked there) is very data driven and continually measures the effectiveness of it's user experience and the pricing, promotion, packaging, etc. - they are a very sophisticated retailer online and are likely at the leading edge of a lot of the algorithms they use to optimize profit - and those are likely proprietary secrets (you know like the formula to KFC's secret spices) and guaarded as such.
This isn't directly related to Amazon's recommendation system, but it might be helpful to study the methods used by people who competed in the Netflix Prize, a contest to develop a better recommendation system using Netflix user data. A lot of good information exists in their community about data mining techniques in general.
The team that won used a blend of the recommendations generated by a lot of different models/techniques. I know that some of the main methods used were principal component analysis, nearest neighbor methods, and neural networks. Here are some papers by the winning team:
R. Bell, Y. Koren, C. Volinsky, "The BellKor 2008 Solution to the Netflix Prize", (2008).
A. Töscher, M. Jahrer, “The BigChaos Solution to the Netflix Prize 2008", (2008).
A. Töscher, M. Jahrer, R. Legenstein, "Improved Neighborhood-Based Algorithms for Large-Scale Recommender Systems", SIGKDD Workshop on Large-Scale Recommender Systems and the Netflix Prize Competition (KDD’08) , ACM Press (2008).
Y. Koren, "The BellKor Solution to the Netflix Grand Prize", (2009).
A. Töscher, M. Jahrer, R. Bell, "The BigChaos Solution to the Netflix Grand Prize", (2009).
M. Piotte, M. Chabbert, "The Pragmatic Theory solution to the Netflix Grand Prize", (2009).
The 2008 papers are from the first year's Progress Prize. I recommend reading the earlier ones first because the later ones build upon the previous work.
I bumped on this paper today:
Amazon.com Recommendations: Item-to-Item Collaborative Filtering
Maybe it provides additional information.
(Disclamer: I used to work at Amazon, though I didn't work on the recommendations team.)
ewernli's answer should be the correct one -- the paper links to Amazon's original recommendation system, and from what I can tell (both from personal experience as an Amazon shopper and having worked on similar systems at other companies), very little has changed: at its core, Amazon's recommendation feature is still very heavily based on item-to-item collaborative filtering.
Just look at what form the recommendations take: on my front page, they're all either of the form "You viewed X...Customers who also viewed this also viewed...", or else a melange of items similar to things I've bought or viewed before. If I specifically go to my "Recommended for You" page, every item describes why it's recommended for me: "Recommended because you purchased...", "Recommended because you added X to your wishlist...", etc. This is a classic sign of item-to-item collaborative filtering.
So how does item-to-item collaborative filtering work? Basically, for each item, you build a "neighborhood" of related items (e.g., by looking at what items people have viewed together or what items people have bought together -- to determine similarity, you can use metrics like the Jaccard index; correlation is another possibility, though I suspect Amazon doesn't use ratings data very heavily). Then, whenever I view an item X or make a purchase Y, Amazon suggests me things in the same neighborhood as X or Y.
Some other approaches that Amazon could potentially use, but likely doesn't, are described here: http://blog.echen.me/2011/02/15/an-overview-of-item-to-item-collaborative-filtering-with-amazons-recommendation-system/
A lot of what Dave describes is almost certainly not done at Amazon. (Ratings by those in my social network? Nope, Amazon doesn't have any of my social data. This would be a massive privacy issue in any case, so it'd be tricky for Amazon to do even if they had that data: people don't want their friends to know what books or movies they're buying. Demographic information? Nope, nothing in the recommendations suggests they're looking at this. [Unlike Netflix, who does surface what other people in my area are watching.])
I don't have any knowledge of Amazon's algorithm specifically, but one component of such an algorithm would probably involve tracking groups of items frequently ordered together, and then using that data to recommend other items in the group when a customer purchases some subset of the group.
Another possibility would be to track the frequency of item B being ordered within N days after ordering item A, which could suggest a correlation.
As far I know, it's use Case-Based Reasoning as an engine for it.
You can see in this sources: here, here and here.
There are many sources in google searching for amazon and case-based reasoning.
If you want a hands-on tutorial (using open-source R) then you could do worse than going through this:
https://gist.github.com/yoshiki146/31d4a46c3d8e906c3cd24f425568d34e
It is a run-time optimised version of another piece of work:
http://www.salemmarafi.com/code/collaborative-filtering-r/
However, the variation of the code on the first link runs MUCH faster so I recommend using that (I found the only slow part of yoshiki146's code is the final routine which generates the recommendation at user level - it took about an hour with my data on my machine).
I adapted this code to work as a recommendation engine for the retailer I work for.
The algorithm used is - as others have said above - collaborative filtering. This method of CF calculates a cosine similarity matrix and then sorts by that similarity to find the 'nearest neighbour' for each element (music band in the example given, retail product in my application).
The resulting table can recommend a band/product based on another chosen band/product.
The next section of the code goes a step further with USER (or customer) based collaborative filtering.
The output of this is a large table with the top 100 bands/products recommended for a given user/customer
Someone did a presentation at our University on something similar last week, and referenced the Amazon recommendation system. I believe that it uses a form of K-Means Clustering to cluster people into their different buying habits. Hope this helps :)
Check this out too: Link and as HTML.

What's the best way to classify a list of sites?

I have a list of X sites that I need to classify in some way. Is the site about cars, health, products or is it about everything(wikihow, about.com, etc?) What are some of the better ways to classify sites like this? Should I get keywords that bring traffic to the site and use those? Should I read the content of some random pages and judge it off of that?
Well if the site is well designed there will be meta tags in the header specifically for this.
Yahoo has a api to extract terms, http://developer.yahoo.com/search/content/V2/termExtraction.html
"The Term Extraction Web Service provides a list of significant words or phrases extracted from a larger content. It is one of the technologies used in Y!Q."
Maybe I'm a bit biased (disclaimer : I have a degree in library science, and this topic is one of the reasons I got the degree), so the easiest answer is that there is no best way.
Consider this like you would database design -- once you have your system populated, what sort of questions are you going to ask of it?
Is the fact that the site is run by the government significant? Or that it uses flash? Or that the pages are blue? Or that it's a hobbyist site? Or that the intended audience is children?.
Then we get the question of if we're going to have a hierarchical category for any of the facets we're concerned with -- if it's about both cars and motorcycles, should we use the term 'vehicles' instead? And if we do that, will we use keyword expansion so that 'motorcycle' matches the broader terms (ie, vehicles) as well?
So ... the point is ... figure out what your needs are, and work towards that. 'Best' will never come, even with years of refinement (if anything, it gets more difficult, as terms start changing meanings. Remember when 'weblog' was related to web server metrics?)
This is a tough question to answer. Consider:
How granular do you want your classification to be?
Do you want to classify sites based on your own criteria or the criteria provided by the sites? In other words, if a site classifies itself as "a premier source for motorcycle maintenance", do you want to create a "motorcycle maintenance" category just for that site? This, of course, will cause your list to become inconsistent. However, if you pigeon-hole the sites to follow your own classification scheme, there is a loss of information, and a risk that the site will not match any of the categories you've defined.
Do you allow subcategories? The problem becomes much more complicated if so.
Can a site belong to more than one category? If so, is there an ordering or a weight (ie. Primary Category, Secondary Categories, etc.), or do you follow a scheme similar to SO's tags?
As an initial stab at the problem, I think I'd define a set of categories, and then spider each site, keeping track of the number of occurrences of each category name, or a mutation thereof. Then, you can choose the name that had the greatest number of "hits."
For instance, given the following categories:
{ "Cars", "Motorcycles", "Video Games" }
Spidering the following blocks of text from a site:
The title is an incongruous play on the title of the book Zen in the Art of Archery by Eugen Herrigel. In its introduction, Pirsig explains that, despite its title, "it should in no way be associated with that great body of factual information relating to orthodox Zen Buddhist practice. It's not very factual on motorcycles, either."
and:
Most motorcycles made since 1980 are pretty reliable if properly maintained but that's a big if. To some extent the high reliability of today's motorcycles has worked to the disadvantage of many riders. Some riders have been lulled into believing that motorcycles are like modern cars and require essentially no maintenance. This is not the case (even with cars). Modern bikes require less maintenance than they did in the 60's and 70's but they still need a lot more maintence than a car. This higher reliability also means that there are a a whole bunch of motorcyclists out there who haven't a clue how to work on their bikes or what really needs to be done to ensure reliability.
We get the following scores:
{ "Cars" : 3, "Motorcycles" : 4, "Video Games" : 0 }
And we can thus categorize the site as being related mostly to "Motorcycles".
Note that I said "mutations thereof" with regards to category names, so "motorcycle" or "car" are both detected. We can see from this that you should also perhaps consider using a list of related words. For instance, perhaps we should detect the word "motorcyclists" when searching for instances of "Motorcycles". Perhaps we should've seen "modern bikes", too.
You could also save those hits, perhaps combined them with some other data, and use Bayesian probability to determine which category the site is most likely to fit into.

Google PageRank: Does it count per domain or per webpage

Is the Google PageRank calculated as one value for a whole website (domain) or is it computed for every single webpage?
How much Google follows publicly known PageRank algorithm is their trade secret. In generic algorithm page rank is calculated per document.
Edit: original, generic PageRank explained http://www.ams.org/featurecolumn/archive/pagerank.html
Google PageRank
Here is a snippet and the link to an explanation from Google Themselves:
http://www.google.com/corporate/tech.html
PageRank Technology: PageRank reflects
our view of the importance of web
pages by considering more than 500
million variables and 2 billion terms.
Pages that we believe are important
pages receive a higher PageRank and
are more likely to appear at the top
of the search results.
PageRank also considers the importance
of each page that casts a vote, as
votes from some pages are considered
to have greater value, thus giving the
linked page greater value. We have
always taken a pragmatic approach to
help improve search quality and create
useful products, and our technology
uses the collective intelligence of
the web to determine a page's
importance.
per webpage.
It should be per web page.

Resources