PageRank is the technique used by Google to determine importance of page on the web. It considers all incoming links to a page as votes for PageRank. But if these votes from different sources weighted equally, this will lead to wrong result. Thus, votes from different sources are weighted according to PageRank of voting page and number of links that voting page contains. Founders of Google, Sergey Brin and Lawrence Page have defined PageRank by a formula. Researchers have done lot of experiments and came up with some conclusions. There are conflicts among opinions of different researchers. This paper starts with formula for PageRank calculation. It describes how to use the formula. Then, paper goes through some examples and derives some observations from it. Paper also contains basic ideas to improve PageRank of web site.
Today search engines are becoming best friends of most of the people for navigation on Internet. User needs to enter keyword or combination of keywords to trigger the search. Search engine searches the web pages available on Internet and returns the result as number of ordered pages. This order should depend on relevancy of pages and importance of pages.
Different search engines use different techniques to decide importance of page on the web. Alta Vista uses HITS (Hyperlink Induced Topic Search) algorithm to rank pages while Google uses PageRank algorithm to rank the pages.
Yahoo also uses some variation of PageRank algorithm that Google uses. There are many variations of Page Ranking algorithm. Like confidence based page ranking, weighted page rank algorithm, query sensitive self-adaptable web page ranking algorithm etc. But, this paper covers only one algorithm that is based on formula given by Bin and Page, the founders of Google.
Most of the people start their web navigation by search engine. Google is the most famous search engine used now days. While presenting search results, they should be ordered by their relevancy and importance on the web. User cannot go through all the pages presented as output of search. Thus all the pages in the collection should be weighted and represented in the order of their weights.
One of the most important factors that Google uses is PageRank. PageRank is a numeric value that represents how important a page is on the web. Off course PageRank is not the only factor, which decides importance of page, but still it is one of them. PageRank is described by one mathematical formula that seems very difficult at first, but actually it is not.
The citation graph of the web is main resource for calculation of PageRank. In the paper “The Anatomy of a Large-Scale Hypertextual Web Search Engine” founders of Google, Sergey Brin and Lawrence Page defined PageRank as:
“We assume page A has pages T1...Tn which point to it (i.e., are citations). The parameter d is a damping factor, which can be set between 0 and 1. We usually set d to 0.85 .……. C(A) is defined as the number of links going out of page A. The PageRank of a page A is given as follows:
PR(A) = (1-d) + d (PR(T1)/C(T1) + ... + PR(Tn)/C(Tn)) “
It's the original formula that was published when PageRank was being developed, and it is probable that Google uses a variation of it but they aren't telling us what it is. Thus, when one page links to other page, first page votes some PageRank to the second page. PageRank of a page is the addition of one constant (1-d=0.15) and the damped value of addition of votes by all pages pointing to it. Value of vote by a particular page depends on PageRank and total number of out links of that page. Thus, higher the PageRank, higher is the value of vote and higher the number of out links, lower is the value of vote.
So every page distributes 85% of its original PageRank evenly among all pages to which it points. e.g. if page A is pointing to four other pages then factor ‘d * PR(A)/4’ will come in PageRank equation of all those four pages. That four factors will add up to d * PR(A) , i.e. 85% of PageRank of A.( From here onwards I’ll refer PageRank as PR) Note that when page votes for another page, it doesn’t give anything from its own PR. Its just voting, only the difference is that weight of vote of a page depends on its own PR. It is same as shareholders meeting where weight of vote of shareholder depends on the shares held.
If we look at the equation, it’s obvious that PR of a page depends on PR of other pages, which are pointing to it, which may depend on PR of original page if it has back link to that page. e.g. If there are only two pages A and B which points to each other, then PR of each page will depend on PR of other, which is yet uncalculated. So where to start from? Surprisingly, we can start with any assumed values of PR. And repeat the calculation of PR values of all pages iteratively till they become stable.
Lets see an example: consider there are only two pages A and B, which points to each other. We don’t know what their PR should be to begin with, so let’s take a guess at 1 and do some calculations.
PR(A) = (1-d) + d * PR(B)
= (1-0.85) + 0.85 * 1
PR(B) = (1-d) + d * PR(A)
= (1-0.85) + 0.85 * 1
Since they are not changing we can stop here. Now lets do the calculations with initial guess at 0:
PR(A) = 0.15 + 0.85 * 0
PR(B) = 0.15 + 0.85 * 0.15
After second iteration:
PR(A) = 0.15 + 0.85 * 0.2775
PR(B) = 0.15 + 0.85 + 0.385875
After third iteration:
PR(A) = 0.15 + 0.85 * 0.47799375
PR(B) = 0.15 + 0.85 * 0.5562946875
and so on. They will keep changing till they reach value of 1. If we start with values greater than 1, then also values will degrade iteration by iteration and will settle down to 1. If we start calculation with different values of PR(A) and PR(B), then also we will end up with PR(1) = PR(B) = 1. Here we considered symmetrical link structure between A and B, thus with any value of PR(A) and PR(B) to start, we ends up with PR(A) = PR(B) = 1. If we take asymmetrical structure e.g. only A is pointing to B and B is not pointing to A, then we will end up with consistent values of PR(A) and PR(B), whatever the values we start with.
For more practice with different numbers of pages with different configurations of citation graphs, visit:
Google uses PR as one of the important factor in search process to fine the relevancy of page. So while searching, web pages are accessed in the decreasing order of PR. One can find PR of its own webpage by using Google toolbar (http://toolbar.google.com/). But output of this toolbar is somewhat unexpected. As our normal PR calculation can yield PR ranging from 0.15 to unlimited. But toolbar gives PR of any page in the range 1 to 10. So actual PR values are divided into intervals and are represented as one of the value ranging from 1 to 10. But the question is whether these intervals are equidistant? Means is the scale linear? No one outside Google knows it. For this question, there are different answers from different researchers. According to Ian Rogers, intervals cannot be equidistant. Means the scale used by Google must be logarithmic. This yields a result that efforts needed to increase PR from 2 to 3 are very less as compared to increase PR from 8 to 9. What these efforts are? At the end of this paper anyone can answer this question.