Google PageRank: Does it count per domain or per webpage - pagerank

Is the Google PageRank calculated as one value for a whole website (domain) or is it computed for every single webpage?

How much Google follows publicly known PageRank algorithm is their trade secret. In generic algorithm page rank is calculated per document.
Edit: original, generic PageRank explained http://www.ams.org/featurecolumn/archive/pagerank.html

Google PageRank
Here is a snippet and the link to an explanation from Google Themselves:
http://www.google.com/corporate/tech.html
PageRank Technology: PageRank reflects
our view of the importance of web
pages by considering more than 500
million variables and 2 billion terms.
Pages that we believe are important
pages receive a higher PageRank and
are more likely to appear at the top
of the search results.
PageRank also considers the importance
of each page that casts a vote, as
votes from some pages are considered
to have greater value, thus giving the
linked page greater value. We have
always taken a pragmatic approach to
help improve search quality and create
useful products, and our technology
uses the collective intelligence of
the web to determine a page's
importance.

per webpage.

It should be per web page.

Related

Detecting duplicate webpages among large number of URLs

From a quote in google blogspot,
"In fact, we found even more than 1 trillion individual links, but not all of
them lead to unique web pages. Many pages have multiple URLs with exactly the same
content or URLs that are auto-generated copies of each other. Even after removing
those exact duplicates . . . "
How does Google detect those exact duplicate webpages or documents? Any idea on Algorithm that Google uses?
According to http://en.wikipedia.org/wiki/MinHash:
A large scale evaluation has been conducted by Google in 2006 [10] to
compare the performance of Minhash and Simhash[11] algorithms. In 2007
Google reported using Simhash for duplicate detection for web
crawling[12] and using Minhash and LSH for Google News
personalization.[13]
A search for Simhash turns up this page:
https://liangsun.org/posts/a-python-implementation-of-simhash-algorithm/
https://github.com/leonsim/simhash
which references a paper written by google employees: Detecting near-duplicates for web crawling
Abstract:
Near-duplicate web documents are abundant. Two such documents differ
from each other in a very small portion that displays advertisements,
for example. Such differences are irrelevant for web search. So the
quality of a web crawler increases if it can assess whether a newly
crawled web page is a near-duplicate of a previously crawled web page
or not. In the course of developing a near-duplicate detection system
for a multi-billion page repository, we make two research
contributions. First, we demonstrate that Charikar's fingerprinting
technique is appropriate for this goal. Second, we present an
algorithmic technique for identifying existing f-bit fingerprints that
differ from a given fingerprint in at most k bit-positions, for small
k. Our technique is useful for both online queries (single
fingerprints) and all batch queries (multiple fingerprints).
Experimental evaluation over real data confirms the practicality of
our design.
Another Simhash paper:
http://simhash.googlecode.com/svn/trunk/paper/SimHashWithBib.pdf‎
possible solutions
exact methods
1) brute force: compare every new page to all visited pages (very slow and inefficient)
2) calculate hash of every visited page (md5,sha1) and store the hashes in a database and look up every new page's hash in the database
3)standard Boolean model of information retrieval (BIR)
........many other possible methods
near exact methods
1)fuzzy hash
2)latent semantic indexing
....

How do search engines evaluate the usefulness of their results?

A search engine returns 2 results A and B for search "play game XYZ".
People who click on result B spend a much longer time and play a lot more XYZ games at site B, while clickers to site A leave the site after a short visit.
I'd imagine site B is a better search result than site A even though it's possible that site B is new and has few links to it in comparison with A. I assume better search engines would take this into account.
My question is, if so, how do they keep track of usage pattern of a search result in the real world?
There are two issues here:
If a user plays game B a lot - he is likely to write and link about it (blogs, reviews, social networks,....) If he does it, the static score of B will raise. This is a part of Page Rank algorithm, that gives the static score of each page and helps the Search Engine decide which page is better.
There is another factor that some Search Engines use: If a user clicked a page, but searched the same/similar query very soon after it - it is likely he did not found what he was after. In this case, the search engine can assume the page is not a good fit ti the query and reduce the score given to this page.
Other then it, the SE cannot really know "how much time you played a game" (unless you revisit it multiple times, by researching the query - and not navigate to it directly, and then it can use the #times the user navigated to the game by searching)
Search engines get results with algorithms like PageRank, which sort websites in it's database by how many sites link to it.
From Wikipedia:
that assigns a numerical weighting to each element of a hyperlinked
set of documents, such as the World Wide Web, with the purpose of
"measuring" its relative importance within the set.
As more sites like to it, it's reputation is assumed to raise, thus it's ranking.
Other methods can be used as well, like search engines can also detect how much time spent on a website, through there external services. Like Google tracks time spent through their widely used tracking/web stat service Google Analytics.
As the other answers mention, another method is detecting a site's relevance to the query is if a similar search is conducted within a short time-frame from the previous. This can indicate if the user actually found what they were looking for on the previous site.

Algorithms/Techniques for rating website (PageRank aside)

I'm looking for algorithms/techniques that are able to present the importance of a a single webpage. Leaving PageRank aside, are there any other methods of doing such a rating based on content, structure and hyperlinks with each other?
I'm not only talking about the connection from www.foo.com to www.bar.com as PageRank does but also from www.foo.com/bar to www.foo.com/baz and so on (beside the fact of adapting PageRank for these needs)
How do I "define" importance: I think of importance in this context as "how relevant is this side to the user, as well as how important it is to the rest of the site".
E.g. A christmas raffle is announced on the startpage with only a single link leading to this site is more important to the user as well as to the site. An imprint, which has a link from every site (since it's mostly somewhere in the footer) is not important although it has many links to it. Imprint is also not important to the site as a "unit" since it doesn't give any real value for the page's puprpose (= giving information, selling products, a general service, etc)
There is also SALSA which is more stable then HITS [so it suffers less from spam].
Since you are also interested in context of pages, you might want to have a look on Haveliwala's work on topic sensitive page rank
Another famous algorithm is the Hubs and Authorities (HITS). Basically you classify your page as either a Hub (a page having a lot of outbound links) and Authorities (a page having a lot of inbound links).
But you should really define what you mean by importance. What does really important mean ? PageRank defines it with respect to the inbound links. That is PageRank definitions.
If you define important as having a photo, because you like photography. Then you could come up with an important metric like number of photos in the page. Another metric could be the number of inbound links from a photography site (like flickr.com, 500px, ...)
Using your definition of important, you could use `1-(the number of inbound links divided by the number of pages on the site). This gives you a number between 0 and 1. 0 means not important and 1 means important.
Using this metric your imprint, which appears on all the pages of the site, has importance of 0. Your Christmas sale page, which has only one link to it, has importance almost 1

How does the PageRank algorithm handle links?

We discussed Google's PageRank algorithm in my algorithms class. What we discussed was that the algorithm represents webpages as a graph and puts them in an adjacency matrix, then does some matrix tweaking.
The only thing is that in the algorithm we discussed, if I link to a webpage, that webpage is also considered to link back to me. This seems to make the matrix multiplication simpler. Is this still the way that PageRank works? If so, why doesn't everyone just link to slashdot.com, yahoo.com, and microsoft.com just to boost their page rankings?
If you read the PageRank paper, you will see that links are not bi-directional, at least for the purposes of the PageRank algorithm. Indeed, it would make no sense if you could boost your page's PageRank by linking to a highly valued site.
If you link to the web page, that web page gets it's pagerank number increased according to your site page rank.
It doesn't work the other way around. Links are not bidirectional. So if you link to slashdot, you won't get any increase in pagerank, if slashdot links to you, you will get increase in pagerank.
Its a mystery beyond what we know about the beginnings of backrub and the paper that avi linked.
My favorite (personal) theory involves lots and lots of hamsters with wheel revolutions per minute heavily influencing the rank of any particular page. I don't know what they give the hamsters .. probably something much milder than LSD.
See the paper "The 25 Billion dollar eigenvector"
http://www.rose-hulman.edu/~bryan/googleFinalVersionFixed.pdf

What searching algorithm/concept is used in Google?

What searching algorithm/concept is used in Google?
The Anatomy of a Large-Scale Hypertextual Web Search Engine
Indexing
If you want to get down to basics:
Google uses an inverted index of the Internet. What this means is that Google has an index of all pages it's crawled based on the terms in each page. For instance the term Google maps to this page, the Google home page, and the Wikipedia article for Google, amongst others.
Thus, when you go to Google and type "Google" into the search box, Google checks its index of all terms available on the Internet and finds the entry for the term "Google" and with it the list of all pages that have that term referenced in it.
For veteran users:
Google's index goes beyond your simple inverted index, however. This is why Google is the best. Google's crawlers (spiders) are smart. Very smart. Beyond just keeping track of the terms that are on any given web page, they also keep track of words that are on related pages and link those to the given document.
In other words, if a page has the term Google in it and the page has a link to or is linked from another web page, the other page may be referenced in the index under the term Google as well. All this and more go into why a given page is returned for a given query.
If you want to go into why pages are ordered the way they are in your search results, that gets into even more interesting stuff.
Ranking
To get down to basics:
Perhaps one of the most basic algorithms a search engine can use to sort your results is known as term frequency-inverse document frequency (tf-idf). Simply put, this means that your results will be ordered by the relative importance of your search terms in the document. In other words, a document that has 10 pages and lists the word Google once is not nearly as important as a document that has 1 page and lists the word Google ten times.
For veteran users:
Again, Google does quite a bit more than your basic search engine when it comes to ranking results. Google has implemented the aforementioned, patented, PageRank algorithm. In short form, PageRank enhances the tf-idf algorithm by taking into account the populatirty/importance of a given page. At this point, popularity/importance may be judged by any number of factors that Google just wont tell us. However, at the most basic of levels, Google can tell that one page is more important than another because loads and loads of other pages link to it.
Google's patented PigeonRank™
Wow, they initially posted this 7 years ago from Wednesday ...
PageRank is a link analysis algorithm used by Google for the search engine, but the patent was assigned to Stanford University.
I think "The Anatomy of a Large-Scale Hypertextual Web Search Engine" is a little outdated.
Hier a recent talk about scalability: Challenges in Building Large-Scale Information Retrieval Systems
Inverted index and MapReduce is the basics of most search engines (I believe). You create an index on the content and run queries against that index to display relevance. Google however does much more than just a simple index of where each word occurs, they also do how many times it appeared, where it appears, where it appears in relation to other words, the ordering, etc. Another simple concept that's used is "stop words" which may include things like "and", "the", and so on (basically "simple" words that occur often and generally not the focus of a query). In addition, they employ things like Page Rank (mentioned by TStamper) to order pages by relevance and importance.
MapReduce is basically taking one job and dividing it into smaller jobs and letting those smaller jobs run on many systems (in parts for scalability and in parts for speed). If I recall correctly, Google was able to make use of "average" computers to distribute jobs to instead of server-grade computers. Since the processing capability of one computer is reaching a peak, many technology are heading towards cloud computing where a job is done by many physical machines.
I'm not sure how much searching Google does, it's more accurately crawling. The difference lies in that they just start at specific points and crawl to anything reachable and repeat until they hit some sort of dead-end.
While being interested in the page rank algorithm and similar I was disturbed to discover that the introduction of personal search at the turn of the year (not widely commented on) seems to change quite a lot - see Failure of the Google Gold Standard and
Google’s Personalized Results
This question cannot be answered canonically. The Algorithms used by Google (and other search engines) are their closest guarded secrets and change constantly. Every correct answer can be invalid a month or a year later.
(I know this doesn't really answer the question, but that's the point, there is no possible answer.)

Resources