I am a student interested in developing a search engine that indexes pages from my country. I have been researching algorithms to use for sometime now and I have identified HITS and PageRank as the best out there. I have decided to go with PageRank since it is more stable than the HITS algorithm (or so I have read).
I have found countless articles and academic papers related to PageRank, but my problem is that I don't understand most of the mathematical symbols that form the algorithm in these papers. Specifically, I don't understand how the Google Matrix (the irreducible,stochastic matrix) is calculated.
My understanding is based on these two articles:
http://online.redwoods.cc.ca.us/instruct/darnold/LAPROJ/fall2005/levicob/LinAlgPaperFinal2-Screen.pdf
http://ilpubs.stanford.edu:8090/386/1/1999-31.pdf
Could someone provide a basic explanation (examples would be nice) with less mathematical symbols?
Thanks in advance.
The formal defintion of PageRank, as defined at page 4 of the cited document, is expressed in the mathematical equation with the funny "E" symbol (it is in fact the capital Sigma Greek letter. Sigma is the letter "S" which here stands for Summation).
In a nutshell this formula says that to calculate the PageRank of page X...
For all the backlinks to this page (=all the pages that link to X)
you need to calculate a value that is
The PageRank of the page that links to X [R'(v)]
divided by
the number of links found on this page. [Nv]
to which you add
some "source of rank", [E(u)] normalized by c
(we'll get to the purpose of that later.)
And you need to make the sum of all these values [The Sigma thing]
and finally, multiply it by a constant [c]
(this constant is just to keep the range of PageRank manageable)
The key idea being this formula is that all web pages that link to a given page X are adding to value to its "worth". By linking to some page they are "voting" in favor of this page. However this "vote" has more or less weight, depending on two factors:
The popularity of the page that links to X [R'(v)]
The fact that the page that links to X also links to many other pages or not. [Nv]
These two factors reflect very intuitive ideas:
It's generally better to get a letter of recommendation from a recognized expert in the field than from a unknown person.
Regardless of who gives the recommendation, by also giving recommendation to other people, they are diminishing the value of their recommendation to you.
As you notice, this formula makes use of somewhat of a circular reference, because to know the page range of X, you need to know the PageRank of all pages linking to X. Then how do you figure these PageRank values?... That's where the next issue of convergence explained in the section of the document kick in.
Essentially, by starting with some "random" (or preferably "decent guess" values of PageRank, for all pages, and by calculating the PageRank with the formula above, the new calculated values get "better", as you iterate this process a few times. The values converge, i.e. they each get closer and closer to what is the actual/theorical value. Therefore by iterating a sufficient amount of times, we reach a moment when additional iterations would not add any practical precision to the values provided by the last iteration.
Now... That is nice and dandy, in theory. The trick is to convert this algorithm to something equivalent but which can be done more quickly. There are several papers that describe how this, and similar tasks, can be done. I don't have such references off-hand, but will add these later. Beware they do will involve a healthy dose of linear algebra.
EDIT: as promised, here are a few links regarding algorithms to calculate page rank.
Efficient Computation of PageRank Haveliwala 1999 ///
Exploiting the Block Structure of the Web for Computing PR Kamvar etal 2003 ///
A fast two-stage algorithm for computing PageRank Lee et al. 2002
Although many of the authors of the links provided above are from Stanford, it doesn't take long to realize that the quest for efficient PageRank-like calculation is a hot field of research. I realize this material goes beyond the scope of the OP, but it is important to hint at the fact that the basic algorithm isn't practical for big webs.
To finish with a very accessible text (yet with many links to in-depth info), I'd like to mention Wikipedia's excellent article
If you're serious about this kind of things, you may consider an introductory/refresher class in maths, particularly linear algebra, as well a computer science class that deal with graphs in general. BTW, great suggestion from Michael Dorfman, in this post, for OCW's video of 1806's lectures.
I hope this helps a bit...
If you are serious about developing an algorithm for a search engine, I'd seriously recommend you take a Linear Algebra course. In the absence of an in-person course, the MIT OCW course by Gilbert Strang is quite good (video lectures at http://ocw.mit.edu/OcwWeb/Mathematics/18-06Spring-2005/VideoLectures/).
A class like this would certainly allow you to understand the mathematical symbols in the document you provide-- there's nothing in that paper that wouldn't be covered in a first-year Linear Algebra course.
I know this isn't the answer you are looking for, but it's really the best option for you. Having someone try to explain the individual symbols or algorithms to you when you don't have a good grasp of the basic concepts isn't a very good use of anybody's time.
This is the paper that you need: http://infolab.stanford.edu/~backrub/google.html (If you do not recognise the names of the authors, you will find more information about them here: http://www.google.com/corporate/execs.html).
The symbols used in the document, are described in the document in lay English.
Thanks for making me google this.
You might also want to read the introductory tutorial on the mathematics behind the construction of the Pagerank matrix written by David Austin's entitled How Google Finds Your Needle in the Web's Haystack; it starts with a simple example and builds to the full definition.
"The $25,000,000,000 Eigenvector: The Linear Algebra Behind Google". from Rose-Hulman is a bit out of date, because now Page Rank is the $491B linear algebra problem. I think the paper is very well written.
"Programming Collective Intelligence" has a nice discussion of Page Rank as well.
Duffymo posted the best refernce in my opinion. I studied the page rank algorithm in my senior undergrad year. Page rank is doing the following:
Define the set of current webpages as the states of a finite markov chain.
Define the probability of transitioning from site u to v where the there is an outgoing link to v from u to be
1/u_{n} where u_{n} is the number of out going links from u.
Assume the markov chain defined above is irreducible (this can be enforced with only a slight degradation of the results)
It can be shown every finite irreducible markov chain has a stationary distribution. Define the page rank to be the stationary distribution, that is to say the vector that holds the probability of a random particle to end up at each given site as the number of state transitions goes to infinity.
Google uses a slight variation on the power method to find the stationary distribution (the power method finds dominant eigenvalues). Other than that there is nothing to it. Its rather simple and elegant and probably one of the simplest applications of markov chains I can think of, but it is wortha lot of money!
So all the pagerank algorithm does is take into account the topology of the web as an indication of whether a website should be important. The more incoming links a site has the greater the probability of a random particle spending its time at the site over an infinite amount of time.
If you want to learn more about page rank with less math, then this is very good tutorial on basic matrix operations. I recommend it for everyone who has little math background but wants to dive into ranking algorithms.
Related
The excellent lmfit package lets one to run nonlinear regression. It can report two different conf intervals - one based on the covarience matrix the other using a more sophisticated tecnique based on an F-test. Details can be found on the doc. I would like to understand he reasoning behind this technique in depth. Which topics should i read about? Note: i have sufficient stats knowledge
F stats and other associated methods for obtaining confidence intervals are far superior to a simple estimation of te co variance matrix for non-linear models (and others).
The primary reason for this is the lack of assumptions about the Gaussian nature of error when using these methods. For non-linear systems, confidence intervals can (they don't have to be) be asymmetric. This means that the parameter value can effect the error surface differently and therefore the one, two, or three sigma limits have different magnitudes in either direction from the best fit.
The analytical ultracentrifugation community has excellent articles involving error analysis (Tom Laue, John J. Correia, Jim Cole, Peter Schuck are some good names for article searches). If you want a good general read about proper error analysis, check out this article by Michael Johnson:
http://www.researchgate.net/profile/Michael_Johnson53/publication/5881059_Nonlinear_least-squares_fitting_methods/links/0deec534d0d97a13a8000000.pdf
Cheers!
How do you find an optimum learning rule for a given problem, say a multiple category classification?
I was thinking of using Genetic Algorithms, but I know there are issues surrounding performance. I am looking for real world examples where you have not used the textbook learning rules, and how you found those learning rules.
Nice question BTW.
classification algorithms can be classified using many Characteristics like:
What does the algorithm strongly prefer (or what type of data that is most suitable for this algorithm).
training overhead. (does it take a lot of time to be trained)
When is it effective. ( large data - medium data - small amount of data ).
the complexity of analyses it can deliver.
Therefore, for your problem classifying multiple categories I will use Online Logistic Regression (FROM SGD) because it's perfect with small to medium data size (less than tens of millions of training examples) and it's really fast.
Another Example:
let's say that you have to classify a large amount of text data. then Naive Bayes is your baby. because it strongly prefers text analysis. even that SVM and SGD are faster, and as I experienced easier to train. but these rules "SVM and SGD" can be applied when the data size is considered as medium or small and not large.
In general any data mining person will ask him self the four afomentioned points when he wants to start any ML or Simple mining project.
After that you have to measure its AUC, or any relevant, to see what have you done. because you might use more than just one classifier in one project. or sometimes when you think that you have found your perfect classifier, the results appear to be not good using some measurement techniques. so you'll start to check your questions again to find where you went wrong.
Hope that I helped.
When you input a vector x to the net, the net will give an output depend on all the weights (vector w). There would be an error between the output and the true answer. The average error (e) is a function of the w, let's say e = F(w). Suppose you have one-layer-two-dimension network, then the image of F may look like this:
When we talk about training, we are actually talking about finding the w which makes the minimal e. In another word, we are searching the minimum of a function. To train is to search.
So, you question is how to choose the method to search. My suggestion would be: It depends on how the surface of F(w) looks like. The wavier it is, the more randomized method should be used, because the simple method based on gradient descending would have bigger chance to guide you trapped by a local minimum - so you lose the chance to find the global minimum. On the another side, if the suface of F(w) looks like a big pit, then forget the genetic algorithm. A simple back propagation or anything based on gradient descending would be very good in this case.
You may ask that how can I know how the surface look like? That's a skill of experience. Or you might want to randomly sample some w, and calculate F(w) to get an intuitive view of the surface.
I am working on edge detection in images and would like to evaluate the performance of algorithm, if any any one could give me a reference or method on how to proceed it will be really helpful. :)
I do not have ground truth and data set includes color as well as gray images.
Thank you.
Create a synthetic data set with known edges, for example by 3D rendering, by compositing 2D images with precise masks (as may be obtained in royalty free photosets), or by introducing edges directly (thin/faint lines). Remember to add some confounding non-edges that look like edges, of a type appropriate for what you're tuning for.
Use your (non-synthetic) data set. Run the reference algorithms that you want to compare against. Also produce combinations of the reference algorithms, for example by voting (majority, at least K out of N, etc). Calculate stats on your algo vs reference algo performance, in terms of (a) number of points your algo classifies as edge which each reference algo, or the combination, does not classify as edge (false positive), or (b) number of points which the reference algo classifies as edge that your algo does not (false negative). You can also calculate a rank correlation-type number for algos by looking at each point and looking at which algos do (or don't) classify that as an edge.
Create ground truth manually. Use reference edge-finding algos as a starting point, then fix up by hand. Probably valuable to do for a small number of images in any case.
Good luck!
For comparisons, quantitative measures like what #Alex I explained is best. To do so, you need to define what is "correct" with a ground truth set and a way to consistently determine if a given image is correct or on a more granular level, how correct (some number like a percentage) it is. #Alex I gave a way to do that.
Another option that is often used in graphics research where there is no ground truth is user studies. Usually less desirable as they are time consuming and often more costly. However, if it is a qualitative improvement that you are after or if a quantitative measurement is just too hard to do, a user study is an appropriate solution.
When I mean user study I mean to poll people on how well a result is given the input image. You could give them a scale to rate things on and randomly give them samples from both your results and the results of another algorithm
And of course, if you still want more ideas, be sure to check out edge detection papers to see how they measured their results (I'd actually look here first as they've already gone through this same process and determined what was best for them: google scholar).
I'm working on Markov Chains and I would like to know of efficient algorithms for constructing probabilistic transition matrices (of order n), given a text file as input.
I am not after one algorithm, but I'd rather like to build a list of such algorithms. Papers on such algorithms are also more than welcome, as any tips on terminology, etc. Notice that this topic bears a strong resemblance with n-gram identification algorithms.
Any help would be much appreciated.
It sounds like there are two possible questions, you should clarify which one:
The 'text file' contains probability values and "n" and you build the matrix directly, but how to code it? This question is trivial, so let's disregard it
The 'text file' contains something like signal data and you want to model it as a Markov Chain.
'Markov Chain' generally refers to a first order stochastic process, so I'm not sure then what you mean by "order", probably the size of the matrix, but that is not typical terminology. Anyway, for 1st-order, n x n matrix, discrete time random process, you should look at Viterbi Algorithm: http://en.wikipedia.org/wiki/Viterbi_algorithm
Whenever dealing with Markov Models, I tend to end up looking at crm114 Discriminator. One, he goes into great detail about what different models there actually are (Markov isn't always the best, depending on what the application is) and provides general links and lots of background information on how probabilistic models work. While crm114 is generally used as some sort of SPAM identification tool, it is actually a more generic probability engine that I have used in other applications.
We discussed Google's PageRank algorithm in my algorithms class. What we discussed was that the algorithm represents webpages as a graph and puts them in an adjacency matrix, then does some matrix tweaking.
The only thing is that in the algorithm we discussed, if I link to a webpage, that webpage is also considered to link back to me. This seems to make the matrix multiplication simpler. Is this still the way that PageRank works? If so, why doesn't everyone just link to slashdot.com, yahoo.com, and microsoft.com just to boost their page rankings?
If you read the PageRank paper, you will see that links are not bi-directional, at least for the purposes of the PageRank algorithm. Indeed, it would make no sense if you could boost your page's PageRank by linking to a highly valued site.
If you link to the web page, that web page gets it's pagerank number increased according to your site page rank.
It doesn't work the other way around. Links are not bidirectional. So if you link to slashdot, you won't get any increase in pagerank, if slashdot links to you, you will get increase in pagerank.
Its a mystery beyond what we know about the beginnings of backrub and the paper that avi linked.
My favorite (personal) theory involves lots and lots of hamsters with wheel revolutions per minute heavily influencing the rank of any particular page. I don't know what they give the hamsters .. probably something much milder than LSD.
See the paper "The 25 Billion dollar eigenvector"
http://www.rose-hulman.edu/~bryan/googleFinalVersionFixed.pdf