Notes on Graphs and PageRank
The next chapter (chapter 11) of Algorithms for Dummies is all about web search which predictably means Google. We’re still in the graphs section of the book because you can think of the web as an interconnecting series of nodes and edges/arcs, just like social graphs. The nodes would be the web pages (or maybe web domains?) and the hyperlinks the arcs. I imagine the edges or connections would be far fewer than those found in social graphs due to the sheer number of web pages.
As I speculated in my last post on this book, the chapter goes over web search at a very high-level so I won’t be able to develop an algorithm to do web searching (and why would I? I have Google). Most of the book’s discussion centered on Google’s PageRank and on the theory of how to rank a page’s relevance to a query.
As usual, there is an algorithm or two that perform the same functions as PageRank.
Theory behind PageRank
In the early days, search engines used keywords used found throughout the web pages or in titles as a key determinant of whether a page will be relevant to a query, but unfortunately, the use of keywords can be easily scammed. Black hat SEOs – those working to figure out the rules and manipulate them – and others would use keyword stuffing or “invisible texts” (an invisible form of keyword stuffing) to fool search engines in showing their pages. PageRank was an effort to get around these web spams of word stuffing or invisible texts. PageRank utilizes hyperlinks to determine the quality of the page or the authority of the page. These links can be both incoming links and outgoing links. The keyword texts are still a prerequisite but are an insufficient determinant on quality of the web page.
I imagine, although the book doesn’t really say it, the quality of the incoming link is important. You wouldn’t want spammers or poor-quality sites linking to your page to increase their own number of links.
I think the authors provided Python programs that mimic what Google was initially doing in PageRank; the Google program today is much more sophisticated. From the programs provided, it looks like the first part of the program is to “initialize” the ranking by setting up a table depicting the number of nodes or websites across the column and the outgoing connections down the rows. So, if a website makes only two outgoing connections, then two rows representative of the other nodes (websites) will have some kind of numerical ranking. This initial table looks to be basically a table of the probability of a user to use the outgoing links. For 2 connections, then it’s 50/50; for 3 connections, its .3333/.3333/.3333; for 4 connections .25/.25/.25/.25. The authors call this table the transition matrix.
This first part of the program also provides an initial PageRanking in a form of a vector and it is basically a simple probability of a user visiting a website. If there are 5 websites in a particular set of datasets, then the probability would be 1/5 for each website. This is just an initial assignment.
Here’s an example that was provided in the book so you can see better what I’m trying to say:
Using the dot product matrix multiplication (one of those mathematical matrix operations) on the transition matrix and the PageRank vector, one can calculate a new PageRank. The program proceeds to do this iteratively, using the transition matrix and each successive new PageRank vector estimates, until the estimates start to stabilize. The example Python shown in the book used 50 iterations.
There is an additional feature in the PageRank computations called teleporting where the user randomly jumps to another network (algorithmically) just to simulate users getting bored and jumping elsewhere. The reason for this twist stems from the action of spammers who create deadends (or rank sinks) that represents a capture with no way out of the network or spider traps (cycles) where you just cycle round and round the same are of the networks. The fix is an introduction of an alpha value that represents the probability that the user will continue the journey or teleport out to another network. It sounds like an additional simulation of random behavior. The suggested alpha value was said to be .8 to .9.
Other Uses of PageRank
There are other uses for PageRank; it looks like a lot of them in the scientific and computer fields. But the algorithm can be used for fraud detection or to provide product recommendations. Of course, the algorithm has to be altered to suit the needs.
Today, Google includes many more algorithms in its drive to improve search relevance. The tech company is now including AI in its page ranking/search efforts and seems to have progressed beyond keywords into semantics, i.e., understanding the meaning behind the request and behind the contents of the page. RankBrain utilizes AI and machine learning to power about 15% of the queries.
Closing This Post: There are other things to consider when doing web search
PageRanking the page is not the only thing going on with web search. I’m guessing there is a lot of upfront work to be done before one can rank a page’s relevance to a query. First of all, you have to know what web pages are out there, so that’s the job of the spiders and bots. Prior chapters talked about navigating through graphs and noting which nodes have already been visited. Are we searching BFS (breadth first search) or DFS (depth first search)? Then there is how do we hold this information? Do we use trees, heap trees or hash functions? What method of sorting or storing will allow the fastest return of the answer? The chapter didn’t talk about these issues, but I’m inferring them because before you can rank the pages, you have to find them and figure out how you are going to store your information for retrieval.
And then there’s the constant battles against the web scammers so the PageRank has to constantly change its algorithms to stymie those bad guys.
So PageRank is not the only element in the contest to conquer web search.
You must be logged in to post a comment.