Determining an a priori ranking of what sites a user has most likely visited - ranking

This is for http://cssfingerprint.com
I have a largish database (~100M rows) of websites. This includes both main domains (both 2LD and 3LD) and particular URLs scraped from those domains (whether hosted there [like most blogs] or only linked from it [like Digg], and with a reference to the host domain).
I also scrape the Alexa top million, Bloglines top 1000, Google pagerank, Technorati top 100, and Quantcast top million rankings. Many domains will have no ranking though, or only a partial set; and nearly all sub-domain URLs have no ranking at all other than Google's 0-10 pagerank (some don't even have that).
I can add any new scrapings necessary, assuming it doesn't require a massive amount of spidering.
I also have a fair amount of information about what sites previous users have visited.
What I need is an algorithm that orders these URLs by how likely a visitor is to have visited that URL without any knowledge of the current visitor. (It can, however, use aggregated information about previous users.)
This question is just about the relatively fixed (or at least aggregated) a priori ranking; there's another question that deals with getting a dynamic ranking.
Given that I have limited resources (both computational and financial), what's the best way for me to rank these sites in order of a priori probability of their having been visited?

Related

Detecting duplicate webpages among large number of URLs

From a quote in google blogspot,
"In fact, we found even more than 1 trillion individual links, but not all of
them lead to unique web pages. Many pages have multiple URLs with exactly the same
content or URLs that are auto-generated copies of each other. Even after removing
those exact duplicates . . . "
How does Google detect those exact duplicate webpages or documents? Any idea on Algorithm that Google uses?
According to http://en.wikipedia.org/wiki/MinHash:
A large scale evaluation has been conducted by Google in 2006 [10] to
compare the performance of Minhash and Simhash[11] algorithms. In 2007
Google reported using Simhash for duplicate detection for web
crawling[12] and using Minhash and LSH for Google News
personalization.[13]
A search for Simhash turns up this page:
https://liangsun.org/posts/a-python-implementation-of-simhash-algorithm/
https://github.com/leonsim/simhash
which references a paper written by google employees: Detecting near-duplicates for web crawling
Abstract:
Near-duplicate web documents are abundant. Two such documents differ
from each other in a very small portion that displays advertisements,
for example. Such differences are irrelevant for web search. So the
quality of a web crawler increases if it can assess whether a newly
crawled web page is a near-duplicate of a previously crawled web page
or not. In the course of developing a near-duplicate detection system
for a multi-billion page repository, we make two research
contributions. First, we demonstrate that Charikar's fingerprinting
technique is appropriate for this goal. Second, we present an
algorithmic technique for identifying existing f-bit fingerprints that
differ from a given fingerprint in at most k bit-positions, for small
k. Our technique is useful for both online queries (single
fingerprints) and all batch queries (multiple fingerprints).
Experimental evaluation over real data confirms the practicality of
our design.
Another Simhash paper:
http://simhash.googlecode.com/svn/trunk/paper/SimHashWithBib.pdf‎
possible solutions
exact methods
1) brute force: compare every new page to all visited pages (very slow and inefficient)
2) calculate hash of every visited page (md5,sha1) and store the hashes in a database and look up every new page's hash in the database
3)standard Boolean model of information retrieval (BIR)
........many other possible methods
near exact methods
1)fuzzy hash
2)latent semantic indexing
....

How do search engines evaluate the usefulness of their results?

A search engine returns 2 results A and B for search "play game XYZ".
People who click on result B spend a much longer time and play a lot more XYZ games at site B, while clickers to site A leave the site after a short visit.
I'd imagine site B is a better search result than site A even though it's possible that site B is new and has few links to it in comparison with A. I assume better search engines would take this into account.
My question is, if so, how do they keep track of usage pattern of a search result in the real world?
There are two issues here:
If a user plays game B a lot - he is likely to write and link about it (blogs, reviews, social networks,....) If he does it, the static score of B will raise. This is a part of Page Rank algorithm, that gives the static score of each page and helps the Search Engine decide which page is better.
There is another factor that some Search Engines use: If a user clicked a page, but searched the same/similar query very soon after it - it is likely he did not found what he was after. In this case, the search engine can assume the page is not a good fit ti the query and reduce the score given to this page.
Other then it, the SE cannot really know "how much time you played a game" (unless you revisit it multiple times, by researching the query - and not navigate to it directly, and then it can use the #times the user navigated to the game by searching)
Search engines get results with algorithms like PageRank, which sort websites in it's database by how many sites link to it.
From Wikipedia:
that assigns a numerical weighting to each element of a hyperlinked
set of documents, such as the World Wide Web, with the purpose of
"measuring" its relative importance within the set.
As more sites like to it, it's reputation is assumed to raise, thus it's ranking.
Other methods can be used as well, like search engines can also detect how much time spent on a website, through there external services. Like Google tracks time spent through their widely used tracking/web stat service Google Analytics.
As the other answers mention, another method is detecting a site's relevance to the query is if a similar search is conducted within a short time-frame from the previous. This can indicate if the user actually found what they were looking for on the previous site.

Algorithms/Techniques for rating website (PageRank aside)

I'm looking for algorithms/techniques that are able to present the importance of a a single webpage. Leaving PageRank aside, are there any other methods of doing such a rating based on content, structure and hyperlinks with each other?
I'm not only talking about the connection from www.foo.com to www.bar.com as PageRank does but also from www.foo.com/bar to www.foo.com/baz and so on (beside the fact of adapting PageRank for these needs)
How do I "define" importance: I think of importance in this context as "how relevant is this side to the user, as well as how important it is to the rest of the site".
E.g. A christmas raffle is announced on the startpage with only a single link leading to this site is more important to the user as well as to the site. An imprint, which has a link from every site (since it's mostly somewhere in the footer) is not important although it has many links to it. Imprint is also not important to the site as a "unit" since it doesn't give any real value for the page's puprpose (= giving information, selling products, a general service, etc)
There is also SALSA which is more stable then HITS [so it suffers less from spam].
Since you are also interested in context of pages, you might want to have a look on Haveliwala's work on topic sensitive page rank
Another famous algorithm is the Hubs and Authorities (HITS). Basically you classify your page as either a Hub (a page having a lot of outbound links) and Authorities (a page having a lot of inbound links).
But you should really define what you mean by importance. What does really important mean ? PageRank defines it with respect to the inbound links. That is PageRank definitions.
If you define important as having a photo, because you like photography. Then you could come up with an important metric like number of photos in the page. Another metric could be the number of inbound links from a photography site (like flickr.com, 500px, ...)
Using your definition of important, you could use `1-(the number of inbound links divided by the number of pages on the site). This gives you a number between 0 and 1. 0 means not important and 1 means important.
Using this metric your imprint, which appears on all the pages of the site, has importance of 0. Your Christmas sale page, which has only one link to it, has importance almost 1

Is This Idea for Loading Online Content in Bulk Feasible?

I devised an idea a long time ago and never got around to implementing it, and I would like to know whether it is practical in that it would work to significantly decrease loading times for modern browsers. It relies on the fact that related tasks are often done more quickly when they are done together in bulk, and that the browser could be downloading content on different pages using a statistical model instead of being idle while the user is browsing. I've pasted below an excerpt from what I originally wrote, which describes the idea.
Description.
When people visit websites, I
conjecture that that a probability
density function P(q, t), where q is a
real-valued integer representing the
ID of a website and t is another
real-valued, non-negative integer
representing the time of the day, can
predict the sequence of webpages
visited by the typical human
accurately enough to warrant
requesting and loading the HTML
documents the user is going to read in
advance. For a given website, have the
document which appears to be the "main
page" of the website through which
users access the other sections be
represented by the root of a tree
structure. The probability that the
user will visit the root node of the
tree can be represented in two ways.
If the user wishes to allow a process
to automatically execute upon the
initialization of the operating system
to pre-fetch webpages from websites
(using a process elaborated later)
which the user frequently accesses
upon opening the web browser, the
probability function which determines
whether a given website will have its
webpages pre-fetched can be determined
using a self-adapting heuristic model
based on the user's history (or by
manual input). Otherwise, if no such
process is desired by the user, the
value of P for the root node is
irrelevant, since the pre-fetching
process is only used after the user
visits the main page of the website.
Children in the tree described earlier
are each associated with an individual
probability function P(q, t) (this
function can be a lookup table which
stores time-webpage pairs). Thus, the
sequences of webpages the user visits
over time are logged using this tree
structure. For instance, at 7:00 AM,
there may be a 71/80 chance that I
visit the "WTF" section on Reddit
after loading the main page of that
site. Based on the values of the
p> robability function P for each node
in the tree, chains of webpages
extending a certain depth from the
root node where the net probability
that each sequence is followed, P_c,
is past a certain threshold, P_min,
are requested upon the user visiting
the main page of the site. If the
downloading of one webpage finishes
before before another is processed, a
thread pool is used so that another
core is assigned the task of parsing
the next webpage in the queue of
webpages to be parsed. Hopefully, in
this manner, a large portion of those
webpages the user clicks may be
displayed much more quickly than they
would be otherwise.
I left out many details and optimizations since I just wanted this to be a brief description of what I was thinking about. Thank you very much for taking the time to read this post; feel free to ask any further questions if you have them.
Interesting idea -- and there have been some implementations for pre-fetching in browsers though without the brains you propose -- which could help alot. I think there are some flaws with this plan:
a) web page browsing, in most cases, is fast enough for most purposes.
b) bandwidth is becoming metered -- if I'm just glancing at the home page, do I as a user want to pay to serve the other pages. Moreover, in the cases where this sort of thing could be useful (eg--slow 3g connection), bandwidth tends to be more tightly metered. And perhaps not so good at concurrency (eg -- CDMA 3g connections).
c) from a server operator's point of view, I'd rather just serve requested pages in most cases. Rendering pages that don't ever get seen costs me cycles and bandwidth. If you are like alot of folks and on some cloud computing platform, you are paying by the cycle and the byte.
d) would require re-building lots of analytics systems, many of which still operate on the theory of request == impression
Or, the short summary is that there really isn't a need to pre-sage what people would view in order to speed serving and rendering pages. Now, places where something like this could be really useful would be in the "hey, if you liked X you probably liked Y" and then popping links and such to said content (or products) to folks.
Windows does the same thing with disk access - it "knows" that you are likely to start let's say Firefox at a certain time and preloads it.
SuperFetch also keeps track of what times of day those applications are used, which allows it to intelligently pre-load information that is expected to be used in the near future.
http://en.wikipedia.org/wiki/Windows_Vista_I/O_technologies#SuperFetch
Pointing out existing tech that does similar thing:
RSS readers load feeds in background, with assumption that user will want to read them sooner or later. There's no probability function that selects feeds for download though, user explicitly selects them
Browser start page and pinned tabs: these load as you start your browser, again user gets to select which websites are worth having around all the time
Your proposal boils down to predicting where user is most likely to click next given current website and current time of day. I can think of few other factors that play role here:
what other websites are open in tabs ("user opened song in youtube, preload lyrics and guitar chords!")
what other applications are running ("user is looking at invoice in e-mail reader, preload internet bank")
which person is using the computer--use webcam to recognize faces, know which sites each one frequents

Google PageRank: Does it count per domain or per webpage

Is the Google PageRank calculated as one value for a whole website (domain) or is it computed for every single webpage?
How much Google follows publicly known PageRank algorithm is their trade secret. In generic algorithm page rank is calculated per document.
Edit: original, generic PageRank explained http://www.ams.org/featurecolumn/archive/pagerank.html
Google PageRank
Here is a snippet and the link to an explanation from Google Themselves:
http://www.google.com/corporate/tech.html
PageRank Technology: PageRank reflects
our view of the importance of web
pages by considering more than 500
million variables and 2 billion terms.
Pages that we believe are important
pages receive a higher PageRank and
are more likely to appear at the top
of the search results.
PageRank also considers the importance
of each page that casts a vote, as
votes from some pages are considered
to have greater value, thus giving the
linked page greater value. We have
always taken a pragmatic approach to
help improve search quality and create
useful products, and our technology
uses the collective intelligence of
the web to determine a page's
importance.
per webpage.
It should be per web page.

Resources